Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-17983

[Parquet][C++][Python] "List index overflow" when read parquet file

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • C++, Parquet, Python
    • None

    Description

      From issue https://github.com/apache/arrow/issues/14229.

      The bug looks like this:

      • create a pandas dataframe with one column and n rows, n < max(int32)
      • each elemenet is a list with m integers, m * n > max(int32)
      • save to a parquet file
      • reading from the parquet file fails with "OSError: List index overflow"

      See comment below on details to reproudce this bug:
      https://github.com/apache/arrow/issues/14229#issuecomment-1272223773

      Tested with a small dataset, the error might come from below code.
      https://github.com/apache/arrow/blob/master/cpp/src/parquet/level_conversion.cc#L63-L64
      OffsetType is int32, but the loop is executed (and *offset is incremented) m * n times which is beyond max(int32).

      Attachments

        Activity

          People

            Unassigned Unassigned
            yibocai Yibo Cai
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: