[ARROW-1459] [Python] PyArrow fails to load partitioned parquet files with non-primitive types - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.6.0
Fix Version/s: 0.7.0
Component/s: Python
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/17479

Description

When reading partitioned parquet files (tested with those produced by Spark), that contain lists, the resulting table seems to contain data loaded only from one partition. Primitive types seems to be loaded correctly.

It can be reproduced using following code (arrow 0.6.0, spark 2.1.1):

>>> df = spark.createDataFrame(list(zip(np.arange(10).tolist(), np.arange(20).reshape((10,2)).tolist())))
>>> df.toPandas()
   _1        _2
0   0    [0, 1]
1   1    [2, 3]
2   2    [4, 5]
3   3    [6, 7]
4   4    [8, 9]
5   5  [10, 11]
6   6  [12, 13]
7   7  [14, 15]
8   8  [16, 17]
9   9  [18, 19]
>>> df.repartition(2).write.parquet('df_parts.parquet')
>>> pq.read_table('df_parts.parquet').to_pandas()
   _1        _2
0   0    [0, 1]
1   2    [4, 5]
2   4    [8, 9]
3   6  [12, 13]
4   8  [16, 17]
5   1    [0, 1]
6   3    [4, 5]
7   5    [8, 9]
8   7  [12, 13]
9   9  [16, 17]

When the data is loaded using Spark or coalesced into one partition, everything works as expected:

>>> spark.read.parquet('df_parts.parquet').toPandas()
   _1        _2
0   1    [2, 3]
1   3    [6, 7]
2   5  [10, 11]
3   7  [14, 15]
4   9  [18, 19]
5   0    [0, 1]
6   2    [4, 5]
7   4    [8, 9]
8   6  [12, 13]
9   8  [16, 17]
>>> df.coalesce(1).write.parquet('df_single.parquet')
>>> pq.read_table('df_single.parquet').to_pandas()
   _1        _2
0   0    [0, 1]
1   1    [2, 3]
2   2    [4, 5]
3   3    [6, 7]
4   4    [8, 9]
5   5  [10, 11]
6   6  [12, 13]
7   7  [14, 15]
8   8  [16, 17]
9   9  [18, 19]

Attachments

Activity

People

Assignee:: Wes McKinney

Reporter:: Jonas Amrich

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 04/Sep/17 14:51

Updated:: 11/Jan/23 07:14

Resolved:: 12/Sep/17 07:13