Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-9790

[Rust] [Parquet] ParquetFileArrowReader fails to decode all pages if batches fall exactly on row group boundaries

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.0.0
    • Rust

    Description

      When I was reading a parquet file into RecordBatches using ParquetFileArrowReader that had row groups that were 100,000 rows in length with a batch size of 60,000, after reading 300,000 rows successfully, I started seeing this error

       ParquetError("Parquet error: Not all children array length are the same!")
      

      Upon investigation, I found that when reading with ParquetFileArrowReader, if the parquet input file has multiple row groups, and if a batch happens to end at the end of a row group for Int or Float, no subsequent row groups are read

      Visually:

      +-----+
      | RG1 |
      |     |
      +-----+  <-- If a batch ends exactly at the end of this row group (page), RG2 is never read
      +-----+
      | RG2 |
      |     |
      +-----+
      

      A reproducer is attached. 20 values should be read by the ParquetFileArrowReader regardless of the batch size. However, when using batch sizes such as 5 or 3 (which fall on a boundary between row groups) not all the rows are read.

      To run the reproducer, decompress the attachment parquet_file_arrow_reader.zip and do `cargo run`

      The output is as follows:

      wrote 20 rows in 4 row groups to /tmp/repro.parquet
      Size when reading with batch_size 100 : 20
      Size when reading with batch_size 7 : 20
      Size when reading with batch_size 5 : 5
      

      The expected output is as follows (should always read 20 rows, regardless of the batch size):

      wrote 20 rows in 4 row groups to /tmp/repro.parquet
      Size when reading with batch_size 100 : 20
      Size when reading with batch_size 7 : 20
      Size when reading with batch_size 5 : 20
      

      Workaround

      Use a different batch size that will not fall on record batch boundaries

      Attachments

        1. parquet_file_arrow_reader.zip
          39 kB
          Andrew Lamb

        Activity

          People

            alamb Andrew Lamb
            alamb Andrew Lamb
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 3h 20m
                3h 20m