Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-5682

[Python] from_pandas conversion casts values to string inconsistently

    XMLWordPrintableJSON

Details

    Description

      When calling pa.Array.from_pandas primitive data as input, and casting to string with "type=pa.string()", the resulting pyarrow Array can have inconsistent values. For most input, the result is an empty string, however for some types (int32, int64) the values are '\x01' etc.

      In [8]: s = pd.Series([1, 2, 3], dtype=np.uint8)
      
      In [9]: pa.Array.from_pandas(s, type=pa.string())                                                                            
      Out[9]: 
      <pyarrow.lib.StringArray object at 0x7f90b6091a48>
      [
        "",
        "",
        ""
      ]
      
      In [10]: s = pd.Series([1, 2, 3], dtype=np.uint32)                                                                           
      
      In [11]: pa.Array.from_pandas(s, type=pa.string())                                                                           
      Out[11]: 
      <pyarrow.lib.StringArray object at 0x7f9097efca48>
      [
        "",
        "",
        ""
      ]
      

      This came from the Spark discussion https://github.com/apache/spark/pull/24930/files#r296187903. Type casting this way in Spark is not supported, but it would be good to get the behavior consistent. Would it be better to raise an UnsupportedOperation error?

      Attachments

        Issue Links

          Activity

            People

              jorisvandenbossche Joris Van den Bossche
              bryanc Bryan Cutler
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h 20m
                  3h 20m