[ARROW-11497] [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.0.0
Fix Version/s: 4.0.0
Component/s: Python
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/27376

Description

Sorry if I don't know this feature is done deliberately, but it looks like the parquet writer for list data type does not conform to Apache Parquet list logical type specification

According to this page: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists, list type contains 3 level where the middle level, named list, must be a repeated group with a single field named element

However, in the parquet file from pyarrow writer, that single field is named item instead,

Please find below the example python code that produce a parquet file (I use pandas version 1.2.1 and pyarrow version 3.0.0)

import pandas as pd
 
df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 'games': [{'name': 'fifa', 'version': '21'}]}, ])
df.to_parquet('/tmp/test.parquet', engine='pyarrow')

Then I use parquet-tools from https://formulae.brew.sh/formula/parquet-tools to check the metadata of parquet file via this command

parquet-tools meta /tmp/test.parquet

The full meta is included in attached, here is only an extraction of list type column

games: OPTIONAL F:1
.list: REPEATED F:1
..item: OPTIONAL F:2
...name: OPTIONAL BINARY L:STRING R:1 D:4
...version: OPTIONAL BINARY L:STRING R:1 D:4

as can be seen, under list, it is single field named item

I think this should be made to be name element to conform with Apache Parquet specification.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

parquet-tools-meta.log
04/Feb/21 13:40
3 kB
Truc Lam Nguyen

Issue Links

relates to

ARROW-14196 [C++][Parquet] Default to compliant nested types in Parquet writer

In Progress

links to

GitHub Pull Request #9489

Activity

People

Assignee:: Truc Lam Nguyen

Reporter:: Truc Lam Nguyen

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 04/Feb/21 13:53

Updated:: 11/Jan/23 08:20

Resolved:: 24/Mar/21 04:09

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2h 40m