Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11142

[C++][Parquet] Inconsistent batch_size usage in parquet GetRecordBatchReader

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • C++

    Description

      The RecordBatchReader returned from parquet::arrow::FileReader::GetRecordBatchReader, which was originally introduced in ARROW-1012 and now exposed in Python (ARROW-7800), shows some inconsistent behaviour in how the batch_size is followed across parquet file row groups.

      See also comments at https://github.com/apache/arrow/pull/6979#issuecomment-754672429

      Small example with a parquet file of 300 rows consisting of 3 row groups of 100 rows:

      table = pa.table({'a': range(300)})
      pq.write_table(table, "test.parquet", row_group_size=100)
      f = pq.ParquetFile("test.parquet")
      

      When reading this with a batch_size that doesn't align with the size of the row groups, by default batches that cross the row group boundaries are returned:

      In [5]: [batch.num_rows for batch in f.iter_batches(batch_size=80)]
      Out[5]: [80, 80, 80, 60]
      

      However, when the file contains a dictionary typed column with string values (integer dictionary values doesn't trigger it), the batches follow row group boundaries:

      table = pa.table({'a': pd.Categorical([str(x) for x in range(300)])})
      pq.write_table(table, "test.parquet", row_group_size=100)
      f = pq.ParquetFile("test.parquet")
      
      In [13]: [batch.num_rows for batch in f.iter_batches(batch_size=80)]
      Out[13]: [80, 20, 60, 40, 40, 60]
      

      But it doesn't start to count again for batch_size at the beginning of a row group, so it only splits batches.

      And additionally, when reading empty batches (empty column selection), then the row group boundaries are followed, but differently (the batching is done independently for each row group):

      In [14]: [batch.num_rows for batch in f.iter_batches(batch_size=80, columns=[])]
      Out[14]: [80, 20, 80, 20, 80, 20]
      

      (this is explicitly coded here: https://github.com/apache/arrow/blob/e05f032c1e5d590ac56372d13ec637bd28b47a96/cpp/src/parquet/arrow/reader.cc#L899-L921)

      I don't know what the expected behaviour should be, but I would at least expect it to be consistent?

      Attachments

        Activity

          People

            Unassigned Unassigned
            jorisvandenbossche Joris Van den Bossche
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: