Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11024

[C++][Parquet] Writing List<Struct> to parquet sometimes writes wrong data

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.0
    • 3.0.0
    • Python
    • macOS Catalina, Python 3.7.3, Pyarrow 2.0.0

    Description

       Sometimes when writing tables that contain List<Struct> columns, the data is written incorrectly. Here is a code sample that produces the error. There are no exceptions raised here, but a simple equality check via equals() yields False for the second test case... 

       

      import pyarrow as pa
      import pyarrow.parquet as pq
      
      # Write small amount of data to parquet file, and read it back. In this case, both tables are equal.
      data1 = [[{'x':'abc','y':'abc'}]]*100 + [[{'x':'abc','y':'gcb'}]]*100
      array1 = pa.array(data1)
      table1 = pa.table([array1],names=['column'])
      pq.write_table(table1,'temp1.parquet')
      table1_1 = pq.read_table('temp1.parquet')
      print(table1_1.equals(table1))
      
      # Write larger amount of data to parquet file, and read it back. In this case, the tables are not equal.
      data2 = data1*100
      array2 = pa.array(data2)
      table2 = pa.table([array2],names=['column'])
      pq.write_table(table2,'temp2.parquet')
      table2_1 = pq.read_table('temp2.parquet')
      print(table2_1.equals(table2))
      
      

       

       

      Attachments

        Issue Links

          Activity

            People

              jorisvandenbossche Joris Van den Bossche
              georgedeamont George Deamont
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m