Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.0.0
Description
Sorry if I don't know this feature is done deliberately, but it looks like the parquet writer for list data type does not conform to Apache Parquet list logical type specification
According to this page: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists, list type contains 3 level where the middle level, named list, must be a repeated group with a single field named element
However, in the parquet file from pyarrow writer, that single field is named item instead,
Please find below the example python code that produce a parquet file (I use pandas version 1.2.1 and pyarrow version 3.0.0)
import pandas as pd df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 'games': [{'name': 'fifa', 'version': '21'}]}, ]) df.to_parquet('/tmp/test.parquet', engine='pyarrow')
Then I use parquet-tools from https://formulae.brew.sh/formula/parquet-tools to check the metadata of parquet file via this command
parquet-tools meta /tmp/test.parquet
The full meta is included in attached, here is only an extraction of list type column
games: OPTIONAL F:1
.list: REPEATED F:1
..item: OPTIONAL F:2
...name: OPTIONAL BINARY L:STRING R:1 D:4
...version: OPTIONAL BINARY L:STRING R:1 D:4
as can be seen, under list, it is single field named item
I think this should be made to be name element to conform with Apache Parquet specification.
Attachments
Attachments
Issue Links
- relates to
-
ARROW-14196 [C++][Parquet] Default to compliant nested types in Parquet writer
- In Progress
- links to