Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-6076

[C++][Parquet] RecordReader::Reset logic is inefficient for small reads

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • C++
    • None

    Description

      We have a unit test

      https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/arrow-reader-writer-test.cc#L933

      that reads 1 record at a time from a Parquet-Arrow column reader. There is logic on RecordReader that advances the definition/repetition levels based on consumed data from previous records, but this is inefficient for this case:

      https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_reader.cc#L1011

      This should be refactored to not require this copying, or at least to only "shift" the levels occasionally

      Attachments

        Activity

          People

            Unassigned Unassigned
            wesm Wes McKinney
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: