Details
-
Improvement
-
Status: In Progress
-
Major
-
Resolution: Unresolved
-
1.0.0
-
None
Description
Take() currently concatenates ChunkedArrays first. However, this breaks down when calling Take() from a ChunkedArray or Table where concatenating the arrays would result in an array that's too large. While inconvenient to implement, it would be useful if this case were handled.
This could be done as a higher-level wrapper around Take(), perhaps.
Example in Python:
>>> import pyarrow as pa >>> pa.__version__ '1.0.0' >>> rb1 = pa.RecordBatch.from_arrays([["a" * 2**30]], names=["a"]) >>> rb2 = pa.RecordBatch.from_arrays([["b" * 2**30]], names=["a"]) >>> table = pa.Table.from_batches([rb1, rb2], schema=rb1.schema) >>> table.take([1, 0]) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "pyarrow/table.pxi", line 1145, in pyarrow.lib.Table.take File "/home/lidavidm/Code/twosigma/arrow/venv/lib/python3.8/site-packages/pyarrow/compute.py", line 268, in take return call_function('take', [data, indices], options) File "pyarrow/_compute.pyx", line 298, in pyarrow._compute.call_function File "pyarrow/_compute.pyx", line 192, in pyarrow._compute.Function.call File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
In this example, it would be useful if Take() or a higher-level wrapper could generate multiple record batches as output.
Attachments
Issue Links
- is a child of
-
ARROW-12633 [C++] Query engine umbrella issue
-
- Open
-
- is duplicated by
-
ARROW-10799 [C++] Take on string chunked arrays slow and fails
-
- Closed
-
-
ARROW-15808 [Python] take function doesn't work when table has large row counts
-
- Closed
-
- is related to
-
ARROW-10799 [C++] Take on string chunked arrays slow and fails
-
- Closed
-
- links to