Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-5869

[Python] Need a way to access UnionArray's children as Arrays in pyarrow

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 0.14.0
    • Fix Version/s: None
    • Component/s: Python
    • Labels:
      None

      Description

       

      There doesn't seem to be a way to get to the children of sparse or dense UnionArrays. For other types, there's

      • ListType: array.flatten()
      • StructType: array.field("fieldname")
      • DictionaryType: array.indices and now array.dictionary (in 0.14.0)
      • (other types have no children, I think...)

      The reason this comes up now is that I have a downstream library that does a zero-copy view of Arrow by recursively walking over its types and interpreting the list of buffers for each type. In the past, I didn't need the array children of each array—I popped the right number of buffers off the list depending on the type—but now the dictionary for DictionaryType has been moved from the type object to the array object (in 0.14.0). Since it's neither in the buffers list, nor in the type tree, I need to walk the tree of arrays in tandem with the tree of types.

      That would be okay, except that I don't see how to descend from a UnionArray to its children.

      This is the function where I do the walk down types (tpe), and now arrays (array), while interpreting the right number of buffers at each step.

      https://github.com/scikit-hep/awkward-array/blob/7c5961405cc39bbf2b489fad171652019c8de41b/awkward/arrow.py#L228-L364

      Simply exposing the std::vector named "children" as a Python sequence or a child(int i) method would provide a way to descend UnionTypes and make this kind of access uniform across all types.

      Alternatively, putting the array.dictionary in the list of buffers would also do it (and make it unnecessary for me to walk over the arrays), but in general it seems like a good idea to make arrays accessible. It seems like it belongs in the buffers, but that would probably be a big change, not to be undertaken for minor reasons.

      Thanks!

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              jpivarski Jim Pivarski
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: