[ARROW-9297] [C++][Dataset] Dataset scanner cannot handle large binary column (> 2 GB) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.0.0
Component/s: C++
Labels:

External issue URL:
https://github.com/apache/arrow/issues/25387

Description

Related to ~~ARROW-3762~~ (the parquet issue which has been solved), and discovered in ~~ARROW-9139~~.

When creating a Parquet file with a large binary column (larger than BinaryArray capacity):

# code from the test_parquet.py::test_binary_array_overflow_to_chunked test
values = [b'x'] + [ 
    b'x' * (1 << 20) 
] * 2 * (1 << 10)                                                                                                                                                                                     

table = pa.table({'byte_col': values})                                                                                                                                                                    
pq.write_table(table, "test_large_binary.parquet")

then reading this with the parquet API works (fixed by ~~ARROW-3762~~):

In [3]: pq.read_table("test_large_binary.parquet")                                                                                                                                        
Out[3]: 
pyarrow.Table
byte_col: binary

but with the Datasets API this still fails:

In [1]: import pyarrow.dataset as ds                                                                                                                                                                               

In [2]: dataset = ds.dataset("test_large_binary.parquet", format="parquet")                                                                                                                                        

In [4]: dataset.to_table()                                                                                                                                                                                         
---------------------------------------------------------------------------
ArrowNotImplementedError                  Traceback (most recent call last)
<ipython-input-4-6fb0d79c4511> in <module>
----> 1 dataset.to_table()

~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset.to_table()

~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.Scanner.to_table()

~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowNotImplementedError: This class cannot yet iterate chunked arrays

Attachments

Issue Links

links to

GitHub Pull Request #7704

Activity

People

Assignee:: Ben Kietzman

Reporter:: Joris Van den Bossche

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 02/Jul/20 12:05

Updated:: 11/Jan/23 08:05

Resolved:: 12/Jul/20 21:54

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

7h 10m