[ARROW-9773] [C++] Take kernel can't handle ChunkedArrays that don't fit in an Array - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: In Progress
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.0.0
Fix Version/s: None
Component/s: C++
Labels:
- kernel
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/25822

Description

Take() currently concatenates ChunkedArrays first. However, this breaks down when calling Take() from a ChunkedArray or Table where concatenating the arrays would result in an array that's too large. While inconvenient to implement, it would be useful if this case were handled.

This could be done as a higher-level wrapper around Take(), perhaps.

Example in Python:

>>> import pyarrow as pa
>>> pa.__version__
'1.0.0'
>>> rb1 = pa.RecordBatch.from_arrays([["a" * 2**30]], names=["a"])
>>> rb2 = pa.RecordBatch.from_arrays([["b" * 2**30]], names=["a"])
>>> table = pa.Table.from_batches([rb1, rb2], schema=rb1.schema)
>>> table.take([1, 0])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/table.pxi", line 1145, in pyarrow.lib.Table.take
  File "/home/lidavidm/Code/twosigma/arrow/venv/lib/python3.8/site-packages/pyarrow/compute.py", line 268, in take
    return call_function('take', [data, indices], options)
  File "pyarrow/_compute.pyx", line 298, in pyarrow._compute.call_function
  File "pyarrow/_compute.pyx", line 192, in pyarrow._compute.Function.call
  File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays

In this example, it would be useful if Take() or a higher-level wrapper could generate multiple record batches as output.

Attachments

Issue Links

is a child of

ARROW-12633 [C++] Query engine umbrella issue

Open

is duplicated by

ARROW-10799 [C++] Take on string chunked arrays slow and fails

Closed

ARROW-15808 [Python] take function doesn't work when table has large row counts

Closed

is related to

ARROW-10799 [C++] Take on string chunked arrays slow and fails

Closed

links to

GitHub Pull Request #13857

Activity

People

Assignee:: Will Jones

Reporter:: Will Jones

Votes:: 5 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 17/Aug/20 19:47

Updated:: 11/Jan/23 08:08

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

4.5h