Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11497

[Python] pyarrow parquet writer for list does not conform with Apache Parquet specification

    XMLWordPrintableJSON

Details

    Description

      Sorry if I don't know this feature is done deliberately, but it looks like the parquet writer for list data type does not conform to Apache Parquet list logical type specification

      According to this page: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists, list type contains 3 level where the middle level, named list, must be a repeated group with a single field named element

      However, in the parquet file from pyarrow writer, that single field is named item instead,

      Please find below the example python code that produce a parquet file (I use pandas version 1.2.1 and pyarrow version 3.0.0) 

      import pandas as pd
       
      df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 'games': [{'name': 'fifa', 'version': '21'}]}, ])
      df.to_parquet('/tmp/test.parquet', engine='pyarrow')
      

      Then I use parquet-tools from https://formulae.brew.sh/formula/parquet-tools to check the metadata of parquet file via this command

      parquet-tools meta /tmp/test.parquet

      The full meta is included in attached, here is only an extraction of list type column

      games: OPTIONAL F:1
      .list: REPEATED F:1
      ..item: OPTIONAL F:2
      ...name: OPTIONAL BINARY L:STRING R:1 D:4
      ...version: OPTIONAL BINARY L:STRING R:1 D:4

      as can be seen, under list, it is single field named item

      I think this should be made to be name element to conform with Apache Parquet specification.

      Attachments

        1. parquet-tools-meta.log
          3 kB
          Truc Lam Nguyen

        Issue Links

          Activity

            People

              trucnguyenlam Truc Lam Nguyen
              trucnguyenlam Truc Lam Nguyen
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 40m
                  2h 40m