[ARROW-6895] [C++][Parquet] parquet::arrow::ColumnReader: ByteArrayDictionaryRecordReader repeats returned values when calling `NextBatch()` - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 0.15.0
Fix Version/s: 0.17.0
Component/s: C++
Labels:
- pull-request-available
Environment:
Linux 5.2.17-200.fc30.x86_64 (Docker)

External issue URL:
https://github.com/apache/arrow/issues/23222

Description

Given most columns, I can run a loop like:

std::unique_ptr<parquet::arrow::ColumnReader> columnReader(/*...*/);
while (nRowsRemaining > 0) {
    int n = std::min(100, nRowsRemaining);
    std::shared_ptr<arrow::ChunkedArray> chunkedArray;
    auto status = columnReader->NextBatch(n, &chunkedArray);
    // ... and then use `chunkedArray`
    nRowsRemaining -= n;
}

(The context is: "convert Parquet to CSV/JSON, with small memory footprint." Used in https://github.com/CJWorkbench/parquet-to-arrow)

Normally, the first NextBatch() return value looks like val0...val99; the second return value looks like val100...val199; and so on.

... but with a ByteArrayDictionaryRecordReader, that isn't the case. The first NextBatch() return value looks like val0...val100; the second return value looks like val0...val99, val100...val199 (ChunkedArray with two arrays); the third return value looks like val0...val99, val100...val199, val200...val299 (ChunkedArray with three arrays); and so on. The returned arrays are never cleared.

In sum: NextBatch() on a dictionary column reader returns the wrong values.

I've attached a minimal Parquet file that presents this problem with the above code; and I've written a patch that fixes this one case, to illustrate where things are wrong. I don't think I understand enough edge cases to decree that my patch is a correct fix.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

01-fix-arrow-6895.diff
18/Feb/20 15:51
0.9 kB
Adam Hooper
bad.parquet
15/Oct/19 19:15
0.5 kB
Adam Hooper
reset-dictionary-on-read.diff
15/Oct/19 19:15
16 kB
Adam Hooper
works.parquet
15/Oct/19 19:15
0.4 kB
Adam Hooper

Issue Links

is duplicated by

ARROW-7545 [C++] [Dataset] Scanning dataset with dictionary type hangs

Closed

links to

GitHub Pull Request #6206

GitHub Pull Request #6460

Activity

People

Assignee:: Adam Hooper

Reporter:: Adam Hooper

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 15/Oct/19 19:20

Updated:: 11/Jan/23 07:50

Resolved:: 27/Mar/20 20:40

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1h 40m