[ARROW-5655] [Python] Table.from_pydict/from_arrays not using types in specified schema correctly - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.16.0
Component/s: Python
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/22090

Description

Example with from_pydict (from https://github.com/apache/arrow/pull/4601#issuecomment-503676534):

In [15]: table = pa.Table.from_pydict(
    ...:     {'a': [1, 2, 3], 'b': [3, 4, 5]},
    ...:     schema=pa.schema([('a', pa.int64()), ('c', pa.int32())]))

In [16]: table
Out[16]: 
pyarrow.Table
a: int64
c: int32

In [17]: table.to_pandas()
Out[17]: 
   a  c
0  1  3
1  2  0
2  3  4

Note that the specified schema has 1) different column names and 2) has a non-default type (int32 vs int64) which leads to corrupted values.

This is partly due to Table.from_pydict not using the type information in the schema to convert the dictionary items to pyarrow arrays. But then it is also Table.from_arrays that is not correctly casting the arrays to another dtype if the schema specifies as such.

Additional question for Table.pydict is whether it actually should override the 'b' key from the dictionary as column 'c' as defined in the schema (this behaviour depends on the order of the dictionary, which is not guaranteed below python 3.6).

Attachments

Issue Links

links to

GitHub Pull Request #5567

Activity

People

Assignee:: Krisztian Szucs

Reporter:: Joris Van den Bossche

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 19/Jun/19 18:34

Updated:: 11/Jan/23 07:41

Resolved:: 08/Oct/19 13:33

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

[Python] Table.from_pydict/from_arrays not using types in specified schema correctly