Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-13509

[C++] Take compute function should pass through ChunkedArray type to handle empty input arrays

    XMLWordPrintableJSON

Details

    Description

      I'm trying to explode a table (in the pandas sense: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html)

      As it's not yet supported, I've writen some code to do it using a mix of list_flatten and list_parent_indices. It works well, excepted it crashed when for empty tables where it crashes.

      WARNING: Logging before InitGoogleLogging() is written to STDERR
      F0730 15:16:05.164858 13612 chunked_array.cc:48]  Check failed: (chunks_.size()) > (0) cannot construct ChunkedArray from empty vector and omitted type
      *** Check failure stack trace: ***Process finished with exit code 134 (interrupted by signal 6: SIGABRT)
      
      

      Here's a reproducable example:

      
      import sys
      
      import pyarrow as pa
      from pyarrow import compute
      import pandas as pd
      
      table = pa.Table.from_arrays(
          [
              pa.array([101, 102, 103], pa.int32()),
              pa.array([['a'], ['a', 'b'], ['a', 'b', 'c']], pa.list_(pa.string()))
          ],
          names=['key', 'list']
      )
      
      
      def explode(table) -> pd.DataFrame:
          exploded_list = compute.list_flatten(table['list'])
      
          indices = compute.list_parent_indices(table['list'])
          assert indices.type == pa.int32()
          keys = compute.take(table['key'], indices)  # <--- Crashes here
          return pa.Table.from_arrays(
              [keys, exploded_list],
              names=['key', 'list_element']
          )
      
      
      explode(table).to_pandas().to_markdown(sys.stdout)
      explode(table.slice(0, 0)).to_pandas().to_markdown(sys.stdout) # <--- doesn't work
      

       
      I've narrowed it down to the following:

      when list_parent_indices is called on an empty table it returns this empty chunk array:

      pa.chunked_array([], pa.int32())
      

      Instead of this chunked array with 1 empty chunk:

      pa.chunked_array([pa.array([], pa.int32())])
      

      In turn take doesn't work with the empty chunked aray:

      compute.take(pa.chunked_array([pa.array([], pa.int32())]),
                   pa.chunked_array([], pa.int32())) # Bad
      compute.take(pa.chunked_array([pa.array([], pa.int32())]),
                   pa.chunked_array([pa.array([], pa.int32())])) # Good
      

      Now in terms of how to fix it there's two solutions:

      • take could accept empty chunked array
      • list_parent_indices could return a chunked array with an empty chunk

      PS: the error message isn't accurate. It says "cannot construct ChunkedArray from empty vector and omitted type". But the array being passed has got a type (int32) but no chunk. It makes me suspect that something in take strip the type of the empty chunked array.

       

      Attachments

        Issue Links

          Activity

            People

              aucahuasi Percy Camilo Triveño Aucahuasi
              0x26dres &res
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 10m
                  2h 10m