Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
0.13.0
Description
When calling pa.Array.from_pandas primitive data as input, and casting to string with "type=pa.string()", the resulting pyarrow Array can have inconsistent values. For most input, the result is an empty string, however for some types (int32, int64) the values are '\x01' etc.
In [8]: s = pd.Series([1, 2, 3], dtype=np.uint8) In [9]: pa.Array.from_pandas(s, type=pa.string()) Out[9]: <pyarrow.lib.StringArray object at 0x7f90b6091a48> [ "", "", "" ] In [10]: s = pd.Series([1, 2, 3], dtype=np.uint32) In [11]: pa.Array.from_pandas(s, type=pa.string()) Out[11]: <pyarrow.lib.StringArray object at 0x7f9097efca48> [ "", "", "" ]
This came from the Spark discussion https://github.com/apache/spark/pull/24930/files#r296187903. Type casting this way in Spark is not supported, but it would be good to get the behavior consistent. Would it be better to raise an UnsupportedOperation error?
Attachments
Issue Links
- links to