Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-6895

[C++][Parquet] parquet::arrow::ColumnReader: ByteArrayDictionaryRecordReader repeats returned values when calling `NextBatch()`

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 0.15.0
    • 0.17.0
    • C++
    • Linux 5.2.17-200.fc30.x86_64 (Docker)

    Description

      Given most columns, I can run a loop like:

      std::unique_ptr<parquet::arrow::ColumnReader> columnReader(/*...*/);
      while (nRowsRemaining > 0) {
          int n = std::min(100, nRowsRemaining);
          std::shared_ptr<arrow::ChunkedArray> chunkedArray;
          auto status = columnReader->NextBatch(n, &chunkedArray);
          // ... and then use `chunkedArray`
          nRowsRemaining -= n;
      }
      

      (The context is: "convert Parquet to CSV/JSON, with small memory footprint." Used in https://github.com/CJWorkbench/parquet-to-arrow)

      Normally, the first NextBatch() return value looks like val0...val99; the second return value looks like val100...val199; and so on.

      ... but with a ByteArrayDictionaryRecordReader, that isn't the case. The first NextBatch() return value looks like val0...val100; the second return value looks like val0...val99, val100...val199 (ChunkedArray with two arrays); the third return value looks like val0...val99, val100...val199, val200...val299 (ChunkedArray with three arrays); and so on. The returned arrays are never cleared.

      In sum: NextBatch() on a dictionary column reader returns the wrong values.

      I've attached a minimal Parquet file that presents this problem with the above code; and I've written a patch that fixes this one case, to illustrate where things are wrong. I don't think I understand enough edge cases to decree that my patch is a correct fix.

      Attachments

        1. reset-dictionary-on-read.diff
          16 kB
          Adam Hooper
        2. bad.parquet
          0.5 kB
          Adam Hooper
        3. works.parquet
          0.4 kB
          Adam Hooper
        4. 01-fix-arrow-6895.diff
          0.9 kB
          Adam Hooper

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            adamhooper Adam Hooper
            adamhooper Adam Hooper
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 1h 40m
                1h 40m

                Slack

                  Issue deployment