[ARROW-1357] [Python] Data corruption in reading multi-file parquet dataset - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.5.0, 0.6.0
Fix Version/s: 0.7.0
Component/s: Python
Labels:
None
Environment:
python 3.5.3

External issue URL:
https://github.com/apache/arrow/issues/17388

Description

I generated a parquet dataset in Spark that has two files. PyArrow corrupts the data of the second file if I read them both in using pyarrow's parquet directory loading mode.

$ ls -l data
total 28608
~~rw-rw-r~~- 1 jarno jarno 14651449 Aug 15 09:30 part-00000-fd2ff92f-201b-4ffc-b0b4-275e06a7fa02.snappy.parquet
~~rw-rw-r~~- 1 jarno jarno 14636502 Aug 15 09:30 part-00001-fd2ff92f-201b-4ffc-b0b4-275e06a7fa02.snappy.parquet

import pyarrow.parquet as pq

tab1 = pq.read_table('data')
df1 = tab1.to_pandas()
df1[df1.account_id == 38658373328].legal_labels.tolist()

[array([ 2, 3, 5, 8, 10, 11, 13, 14, 17, 18, 19, 21, 22, 31, 60, 61, 63,
64, 65, 66, 69, 70, 74, 75, 77, 82, 0, 1, 2, 3, 5, 8, 10, 11,
13, 14, 17, 18, 19, 21, 22])]

tab2 = pq.read_table('data/part-00001-fd2ff92f-201b-4ffc-b0b4-275e06a7fa02.snappy.parquet')
df2 = tab2.to_pandas()
df2[df2.account_id == 38658373328].legal_labels.tolist()

[array([ 0, 1, 2, 3, 5, 8, 10, 11, 13, 14, 17, 18, 19, 21, 22, 24, 28,
30, 31, 36, 38, 39, 40, 41, 43, 49, 60, 61, 62, 63, 64, 65, 66, 67,
69, 70, 74, 75, 77, 82, 90])]

Unfortunately I cannot share the data files, and I was not able to create a dummy data file pair that would have triggered the bug. I'm sending this bug report in the hope that it is still useful without a minimal repro example.

Attachments

Activity

People

Assignee:: Wes McKinney

Reporter:: Jarno Seppanen

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 16/Aug/17 09:01

Updated:: 11/Jan/23 07:14

Resolved:: 20/Aug/17 17:49