[ARROW-9790] [Rust] [Parquet] ParquetFileArrowReader fails to decode all pages if batches fall exactly on row group boundaries - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.0.0
Component/s: Rust
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/25836

Description

When I was reading a parquet file into RecordBatches using ParquetFileArrowReader that had row groups that were 100,000 rows in length with a batch size of 60,000, after reading 300,000 rows successfully, I started seeing this error

 ParquetError("Parquet error: Not all children array length are the same!")

Upon investigation, I found that when reading with ParquetFileArrowReader, if the parquet input file has multiple row groups, and if a batch happens to end at the end of a row group for Int or Float, no subsequent row groups are read

Visually:

+-----+
| RG1 |
|     |
+-----+  <-- If a batch ends exactly at the end of this row group (page), RG2 is never read
+-----+
| RG2 |
|     |
+-----+

A reproducer is attached. 20 values should be read by the ParquetFileArrowReader regardless of the batch size. However, when using batch sizes such as 5 or 3 (which fall on a boundary between row groups) not all the rows are read.

To run the reproducer, decompress the attachment parquet_file_arrow_reader.zip and do `cargo run`

The output is as follows:

wrote 20 rows in 4 row groups to /tmp/repro.parquet
Size when reading with batch_size 100 : 20
Size when reading with batch_size 7 : 20
Size when reading with batch_size 5 : 5

The expected output is as follows (should always read 20 rows, regardless of the batch size):

wrote 20 rows in 4 row groups to /tmp/repro.parquet
Size when reading with batch_size 100 : 20
Size when reading with batch_size 7 : 20
Size when reading with batch_size 5 : 20

Workaround

Use a different batch size that will not fall on record batch boundaries

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

parquet_file_arrow_reader.zip
18/Aug/20 22:23
39 kB
Andrew Lamb

Issue Links

links to

GitHub Pull Request #8007

GitHub Pull Request #8009

Activity

People

Assignee:: Andrew Lamb

Reporter:: Andrew Lamb

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 18/Aug/20 22:24

Updated:: 11/Jan/23 08:09

Resolved:: 20/Aug/20 01:02

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

3h 20m