[ARROW-11142] [C++][Parquet] Inconsistent batch_size usage in parquet GetRecordBatchReader - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: C++
Labels:
- parquet

External issue URL:
https://github.com/apache/arrow/issues/27052

Description

The RecordBatchReader returned from parquet::arrow::FileReader::GetRecordBatchReader, which was originally introduced in ~~ARROW-1012~~ and now exposed in Python (~~ARROW-7800~~), shows some inconsistent behaviour in how the batch_size is followed across parquet file row groups.

See also comments at https://github.com/apache/arrow/pull/6979#issuecomment-754672429

Small example with a parquet file of 300 rows consisting of 3 row groups of 100 rows:

table = pa.table({'a': range(300)})
pq.write_table(table, "test.parquet", row_group_size=100)
f = pq.ParquetFile("test.parquet")

When reading this with a batch_size that doesn't align with the size of the row groups, by default batches that cross the row group boundaries are returned:

In [5]: [batch.num_rows for batch in f.iter_batches(batch_size=80)]
Out[5]: [80, 80, 80, 60]

However, when the file contains a dictionary typed column with string values (integer dictionary values doesn't trigger it), the batches follow row group boundaries:

table = pa.table({'a': pd.Categorical([str(x) for x in range(300)])})
pq.write_table(table, "test.parquet", row_group_size=100)
f = pq.ParquetFile("test.parquet")

In [13]: [batch.num_rows for batch in f.iter_batches(batch_size=80)]
Out[13]: [80, 20, 60, 40, 40, 60]

But it doesn't start to count again for batch_size at the beginning of a row group, so it only splits batches.

And additionally, when reading empty batches (empty column selection), then the row group boundaries are followed, but differently (the batching is done independently for each row group):

In [14]: [batch.num_rows for batch in f.iter_batches(batch_size=80, columns=[])]
Out[14]: [80, 20, 80, 20, 80, 20]

(this is explicitly coded here: https://github.com/apache/arrow/blob/e05f032c1e5d590ac56372d13ec637bd28b47a96/cpp/src/parquet/arrow/reader.cc#L899-L921)

—

I don't know what the expected behaviour should be, but I would at least expect it to be consistent?

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Joris Van den Bossche

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 06/Jan/21 10:45

Updated:: 11/Jan/23 08:17