Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-5655

[Python] Table.from_pydict/from_arrays not using types in specified schema correctly

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      Example with from_pydict (from https://github.com/apache/arrow/pull/4601#issuecomment-503676534):

      In [15]: table = pa.Table.from_pydict(
          ...:     {'a': [1, 2, 3], 'b': [3, 4, 5]},
          ...:     schema=pa.schema([('a', pa.int64()), ('c', pa.int32())]))
      
      In [16]: table
      Out[16]: 
      pyarrow.Table
      a: int64
      c: int32
      
      In [17]: table.to_pandas()
      Out[17]: 
         a  c
      0  1  3
      1  2  0
      2  3  4
      

      Note that the specified schema has 1) different column names and 2) has a non-default type (int32 vs int64) which leads to corrupted values.

      This is partly due to Table.from_pydict not using the type information in the schema to convert the dictionary items to pyarrow arrays. But then it is also Table.from_arrays that is not correctly casting the arrays to another dtype if the schema specifies as such.

      Additional question for Table.pydict is whether it actually should override the 'b' key from the dictionary as column 'c' as defined in the schema (this behaviour depends on the order of the dictionary, which is not guaranteed below python 3.6).

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            kszucs Krisztian Szucs Assign to me
            jorisvandenbossche Joris Van den Bossche
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - Not Specified
              Not Specified
              Remaining:
              Remaining Estimate - 0h
              0h
              Logged:
              Time Spent - 1h 40m
              1h 40m

              Issue deployment