Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-15643

[C++] Kernel to select subset of fields of a StructArray

    XMLWordPrintableJSON

Details

    Description

      Triggered by https://stackoverflow.com/questions/71035754/pyarrow-drop-a-column-in-a-nested-structure. I thought there was already an issue about this, but don't directly find one.

      Assume you have a struct array with some fields:

      >>> arr = pa.StructArray.from_arrays([[1, 2, 3]]*3, names=['a', 'b', 'c'])
      >>> arr.type
      StructType(struct<a: int64, b: int64, c: int64>)
      

      We have a kernel to select a single child field:

      >>> pc.struct_field(arr, [0])
      <pyarrow.lib.Int64Array object at 0x7ffa9e229940>
      [
        1,
        2,
        3
      ]
      

      But if you want to subset the StructArray to some of its fields, resulting in a new StructArray, that's not possible with struct_field, and doing this manually is a bit cumbersome:

      >>> fields = ['a', 'c']
      >>> arrays = [arr.field(n) for n in fields]
      >>> arr_subset = pa.StructArray.from_arrays(arrays, names=fields)
      >>> arr_subset.type
      StructType(struct<a: int64, c: int64>)
      

      (this is still OK, but if you had a ChunkedArray, it certainly gets annoying)

      One option could be to expand the existing struct_field to allow selecting multiple fields (although that probably gets ambigous/confusing with how you currently select a recursively nested field -> [0, 1] currently means "first child, second subchild" and not "first and second child").
      Or a new kernel like "struct_subset" or some other name.

      This might also overlap with general projection functionality? (cc westonpace)

      Attachments

        Issue Links

          Activity

            People

              dhruv9vats Dhruv Vats
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 4h 10m
                  4h 10m