Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-9773

[C++] Take kernel can't handle ChunkedArrays that don't fit in an Array

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 1.0.0
    • None
    • C++

    Description

      Take() currently concatenates ChunkedArrays first. However, this breaks down when calling Take() from a ChunkedArray or Table where concatenating the arrays would result in an array that's too large. While inconvenient to implement, it would be useful if this case were handled.

      This could be done as a higher-level wrapper around Take(), perhaps.

      Example in Python:

      >>> import pyarrow as pa
      >>> pa.__version__
      '1.0.0'
      >>> rb1 = pa.RecordBatch.from_arrays([["a" * 2**30]], names=["a"])
      >>> rb2 = pa.RecordBatch.from_arrays([["b" * 2**30]], names=["a"])
      >>> table = pa.Table.from_batches([rb1, rb2], schema=rb1.schema)
      >>> table.take([1, 0])
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "pyarrow/table.pxi", line 1145, in pyarrow.lib.Table.take
        File "/home/lidavidm/Code/twosigma/arrow/venv/lib/python3.8/site-packages/pyarrow/compute.py", line 268, in take
          return call_function('take', [data, indices], options)
        File "pyarrow/_compute.pyx", line 298, in pyarrow._compute.call_function
        File "pyarrow/_compute.pyx", line 192, in pyarrow._compute.Function.call
        File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
        File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
      pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
      

      In this example, it would be useful if Take() or a higher-level wrapper could generate multiple record batches as output.

      Attachments

        Issue Links

          Activity

            People

              wjones127 Will Jones
              wjones127 Will Jones
              Votes:
              5 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 4.5h
                  4.5h