[ARROW-11069] [C++] Parquet writer incorrect data being written when data type is struct - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: 2.0.0
Fix Version/s: 3.0.0
Component/s: Python
Labels:
None
Environment:
pandas v1.0.4

Flags:

Important
External issue URL:
https://github.com/apache/arrow/issues/18438

Description

When writing a dict column using pyarrow.

import pandas as pd

orig = pd.read_parquet("original.parquet")
orig.to_parquet("first_write.parquet")

first_write = pd.read_parquet("first_write.parquet")

print(orig.equals(first_write))

This incorrect results start appearing after index 1024. first_write.parquet was created after reading and then writing it again. I don't see any obvious pattern in the shuffled rows.

Original records

Written records

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

first_write.parquet
29/Dec/20 19:40
5 kB
Palash Goel
image-2020-12-30-01-19-20-491.png
29/Dec/20 19:49
109 kB
Palash Goel
image-2020-12-30-01-19-42-739.png
29/Dec/20 19:49
109 kB
Palash Goel
image-2020-12-30-01-20-45-183.png
29/Dec/20 19:50
130 kB
Palash Goel
original.parquet
29/Dec/20 19:40
5 kB
Palash Goel

Issue Links

is duplicated by

ARROW-10493 [C++][Parquet] Writing nullable nested strings results in wrong data in file

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Palash Goel

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 29/Dec/20 19:43

Updated:: 11/Jan/23 08:17

Resolved:: 04/Jan/21 12:20