[ARROW-11344] [Python] Data of struct fields are our-of-order in parquet files created by the write_table() method - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.0.0
Fix Version/s: None
Component/s: Python
Labels:
- good-first-issue
- needs-test

External issue URL:
https://github.com/apache/arrow/issues/27241

Description

Hi,

We found an out-of-order issue with the 'struct' data type recently, would like to know if you can help to root cause it.

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.read_csv('./test_struct.csv')
print(df.dtypes)
df['full_name'] = df.apply(lambda x: {"package": x['file_package'], "name": x["file_name"]}, axis=1)
my_df = df.drop(['file_package', 'file_name'], axis=1)

file_fields = [('package', pa.string()), ('name', pa.string()),]
my_schema = pa.schema([pa.field('full_name', pa.struct(file_fields)),
                       pa.field('fruit_name', pa.string())])
my_table = pa.Table.from_pandas(my_df, schema = my_schema)
print('Table schema:')
print(my_table.schema)

pq.write_table(my_table, './test_struct_200.parquet')

The above code (attached as test_struct_200.py) runs with the following python packages:

Pandas Version = 1.1.3
PyArrow Version = 2.0.0

Then I use parquet-tools (1.11.1) to read the file, but get the following output:

$ java -jar parquet-tools-1.11.1.jar head -n 2181 test_struct_200.parquet
...
full_name:
.package = fruit.zip
.name = apple.csv
fruit_name = strawberry

full_name:
.package = fruit.zip
.name = apple.csv
fruit_name = strawberry

full_name:
.package = fruit.zip
.name = apple.csv
fruit_name = strawberry

(BTW, you can also view the parquet file with http://parquet-viewer-online.com/)

The output is supposed to be (refer to test_struct.csv) :

$ java -jar parquet-tools-1.11.1.jar head -n 2181 test_struct_200.parquet
...
full_name:
.package = fruit.zip
.name = strawberry.csv
fruit_name = strawberry

full_name:
.package = fruit.zip
.name = strawberry.csv
fruit_name = strawberry

full_name:
.package = fruit.zip
.name = strawberry.csv
fruit_name = strawberry

As a comparison, the following code (attached as test_struct_200_flat.py) would generate a parquet file with the same data of test_struct.csv:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.read_csv('./test_struct.csv')
print(df.dtypes)
my_schema = pa.schema([pa.field('file_package', pa.string()),
                       pa.field('file_name', pa.string()),
                       pa.field('fruit_name', pa.string())])
my_table = pa.Table.from_pandas(df, schema = my_schema)
print('Table schema:')
print(my_table.schema)

pq.write_table(my_table, './test_struct_200_flat.parquet')

I also attached the two parquet files for your references.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

test_struct.csv
22/Jan/21 06:53
62 kB
Ming Chen
test_struct_200.parquet
22/Jan/21 06:53
3 kB
Ming Chen
test_struct_200.py
22/Jan/21 06:53
0.6 kB
Ming Chen
test_struct_200_flat.py
22/Jan/21 06:53
0.5 kB
Ming Chen
test_struct_200_flat.parquet
22/Jan/21 06:53
3 kB
Ming Chen

Activity

People

Assignee:: Unassigned

Reporter:: Ming Chen

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 22/Jan/21 06:58

Updated:: 11/Jan/23 08:19