Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11607

[Python] Error when reading table with list values from parquet

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.0.0, 1.0.1, 2.0.0, 3.0.0
    • 4.0.0
    • C++, Python
    • Python 3.7

    Description

      I'm getting unexpected results when reading tables containing list values and a large number of rows from a parquet file.

      Example code (pyarrow 2.0.0 and 3.0.0):

      from pyarrow import parquet, Table
      
      data = [None] * (1 << 20)
      data.append([1])
      
      table = Table.from_arrays([data], ['column'])
      print('Expected: %s' % table['column'][-1])
      
      parquet.write_table(table, 'table.parquet')
      
      table2 = parquet.read_table('table.parquet')
      print('Actual:   %s' % table2['column'][-1]

      Output:

      Expected: [1]
      Actual:   [0]

      When I decrease the number of rows by 1 (by using (1 << 20) - 1), I get:

      Expected: [1]
      Actual:   [1]

      For pyarrow 1.0.1 and 1.0.0, the threshold number of rows is 1 << 15.

      It seems that this is caused by some overflow and memory corruption because in pyarrow 3.0.0 with more complex values (list of dictionaries with float and datetime):

      data.append([{'a': 0.1, 'b': datetime.now()}])
      

      I'm getting this exception after calling table2.to_pandas() :

      /arrow/cpp/src/arrow/memory_pool.cc:501: Internal error: cannot create default memory pool

       

      Attachments

        Issue Links

          Activity

            People

              emkornfield Micah Kornfield
              misogl Michal Glaus
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1.5h
                  1.5h

                  Slack

                    Issue deployment