[ARROW-5682] [Python] from_pandas conversion casts values to string inconsistently - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 0.13.0
Fix Version/s: 0.15.0
Component/s: Python
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/22113

Description

When calling pa.Array.from_pandas primitive data as input, and casting to string with "type=pa.string()", the resulting pyarrow Array can have inconsistent values. For most input, the result is an empty string, however for some types (int32, int64) the values are '\x01' etc.

In [8]: s = pd.Series([1, 2, 3], dtype=np.uint8)

In [9]: pa.Array.from_pandas(s, type=pa.string())                                                                            
Out[9]: 
<pyarrow.lib.StringArray object at 0x7f90b6091a48>
[
  "",
  "",
  ""
]

In [10]: s = pd.Series([1, 2, 3], dtype=np.uint32)                                                                           

In [11]: pa.Array.from_pandas(s, type=pa.string())                                                                           
Out[11]: 
<pyarrow.lib.StringArray object at 0x7f9097efca48>
[
  "",
  "",
  ""
]

This came from the Spark discussion https://github.com/apache/spark/pull/24930/files#r296187903. Type casting this way in Spark is not supported, but it would be good to get the behavior consistent. Would it be better to raise an UnsupportedOperation error?

Attachments

Issue Links

links to

GitHub Pull Request #5333

Activity

People

Assignee:: Joris Van den Bossche

Reporter:: Bryan Cutler

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 21/Jun/19 17:46

Updated:: 11/Jan/23 07:42

Resolved:: 12/Sep/19 11:23

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

3h 20m