Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11344

[Python] Data of struct fields are our-of-order in parquet files created by the write_table() method

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.0.0
    • None
    • Python

    Description

      Hi,

      We found an out-of-order issue with the 'struct' data type recently, would like to know if you can help to root cause it.

      import pandas as pd
      import pyarrow as pa
      import pyarrow.parquet as pq
      
      df = pd.read_csv('./test_struct.csv')
      print(df.dtypes)
      df['full_name'] = df.apply(lambda x: {"package": x['file_package'], "name": x["file_name"]}, axis=1)
      my_df = df.drop(['file_package', 'file_name'], axis=1)
      
      file_fields = [('package', pa.string()), ('name', pa.string()),]
      my_schema = pa.schema([pa.field('full_name', pa.struct(file_fields)),
                             pa.field('fruit_name', pa.string())])
      my_table = pa.Table.from_pandas(my_df, schema = my_schema)
      print('Table schema:')
      print(my_table.schema)
      
      pq.write_table(my_table, './test_struct_200.parquet')
      

      The above code (attached as test_struct_200.py) runs with the following python packages:

      Pandas Version = 1.1.3
      PyArrow Version = 2.0.0
      

      Then I use parquet-tools (1.11.1) to read the file, but get the following output:

      $ java -jar parquet-tools-1.11.1.jar head -n 2181 test_struct_200.parquet
      ...
      full_name:
      .package = fruit.zip
      .name = apple.csv
      fruit_name = strawberry
      
      full_name:
      .package = fruit.zip
      .name = apple.csv
      fruit_name = strawberry
      
      full_name:
      .package = fruit.zip
      .name = apple.csv
      fruit_name = strawberry
      

      (BTW, you can also view the parquet file with http://parquet-viewer-online.com/)

      The output is supposed to be (refer to test_struct.csv) :

      $ java -jar parquet-tools-1.11.1.jar head -n 2181 test_struct_200.parquet
      ...
      full_name:
      .package = fruit.zip
      .name = strawberry.csv
      fruit_name = strawberry
      
      full_name:
      .package = fruit.zip
      .name = strawberry.csv
      fruit_name = strawberry
      
      full_name:
      .package = fruit.zip
      .name = strawberry.csv
      fruit_name = strawberry
      

      As a comparison, the following code (attached as test_struct_200_flat.py) would generate a parquet file with the same data of test_struct.csv:

      import pandas as pd
      import pyarrow as pa
      import pyarrow.parquet as pq
      
      df = pd.read_csv('./test_struct.csv')
      print(df.dtypes)
      my_schema = pa.schema([pa.field('file_package', pa.string()),
                             pa.field('file_name', pa.string()),
                             pa.field('fruit_name', pa.string())])
      my_table = pa.Table.from_pandas(df, schema = my_schema)
      print('Table schema:')
      print(my_table.schema)
      
      pq.write_table(my_table, './test_struct_200_flat.parquet')
      

      I also attached the two parquet files for your references.

      Attachments

        1. test_struct.csv
          62 kB
          Ming Chen
        2. test_struct_200.py
          0.6 kB
          Ming Chen
        3. test_struct_200.parquet
          3 kB
          Ming Chen
        4. test_struct_200_flat.py
          0.5 kB
          Ming Chen
        5. test_struct_200_flat.parquet
          3 kB
          Ming Chen

        Activity

          People

            Unassigned Unassigned
            acan Ming Chen
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: