Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
2.0.0
-
None
Description
Hi,
We found an out-of-order issue with the 'struct' data type recently, would like to know if you can help to root cause it.
import pandas as pd import pyarrow as pa import pyarrow.parquet as pq df = pd.read_csv('./test_struct.csv') print(df.dtypes) df['full_name'] = df.apply(lambda x: {"package": x['file_package'], "name": x["file_name"]}, axis=1) my_df = df.drop(['file_package', 'file_name'], axis=1) file_fields = [('package', pa.string()), ('name', pa.string()),] my_schema = pa.schema([pa.field('full_name', pa.struct(file_fields)), pa.field('fruit_name', pa.string())]) my_table = pa.Table.from_pandas(my_df, schema = my_schema) print('Table schema:') print(my_table.schema) pq.write_table(my_table, './test_struct_200.parquet')
The above code (attached as test_struct_200.py) runs with the following python packages:
Pandas Version = 1.1.3 PyArrow Version = 2.0.0
Then I use parquet-tools (1.11.1) to read the file, but get the following output:
$ java -jar parquet-tools-1.11.1.jar head -n 2181 test_struct_200.parquet ... full_name: .package = fruit.zip .name = apple.csv fruit_name = strawberry full_name: .package = fruit.zip .name = apple.csv fruit_name = strawberry full_name: .package = fruit.zip .name = apple.csv fruit_name = strawberry
(BTW, you can also view the parquet file with http://parquet-viewer-online.com/)
The output is supposed to be (refer to test_struct.csv) :
$ java -jar parquet-tools-1.11.1.jar head -n 2181 test_struct_200.parquet ... full_name: .package = fruit.zip .name = strawberry.csv fruit_name = strawberry full_name: .package = fruit.zip .name = strawberry.csv fruit_name = strawberry full_name: .package = fruit.zip .name = strawberry.csv fruit_name = strawberry
As a comparison, the following code (attached as test_struct_200_flat.py) would generate a parquet file with the same data of test_struct.csv:
import pandas as pd import pyarrow as pa import pyarrow.parquet as pq df = pd.read_csv('./test_struct.csv') print(df.dtypes) my_schema = pa.schema([pa.field('file_package', pa.string()), pa.field('file_name', pa.string()), pa.field('fruit_name', pa.string())]) my_table = pa.Table.from_pandas(df, schema = my_schema) print('Table schema:') print(my_table.schema) pq.write_table(my_table, './test_struct_200_flat.parquet')
I also attached the two parquet files for your references.