Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-17813

[Python] Nested ExtensionArray conversion to/from pandas/numpy

    XMLWordPrintableJSON

Details

    Description

      user@ thread: https://lists.apache.org/thread/dhnxq0g4kgdysjowftfv3z5ngj780xpb
      repro gist: https://gist.github.com/changhiskhan/4163f8cec675a2418a69ec9168d5fdd9

      Arrow => numpy/pandas

      For a non-nested array, pa.ExtensionArray.to_numpy automatically "lowers" to the storage type (as expected). However this is not done for nested arrays:

      import pyarrow as pa
      
      class LabelType(pa.ExtensionType):
      
          def __init__(self):
              super(LabelType, self).__init__(pa.string(), "label")
      
          def __arrow_ext_serialize__(self):
              return b""
      
          @classmethod
          def __arrow_ext_deserialize__(cls, storage_type, serialized):
              return LabelType()
          
      storage = pa.array(["dog", "cat", "horse"])
      ext_arr = pa.ExtensionArray.from_storage(LabelType(), storage)
      offsets = pa.array([0, 1])
      list_arr = pa.ListArray.from_arrays(offsets, ext_arr)
      list_arr.to_numpy()
      
      ---------------------------------------------------------------------------
      ArrowNotImplementedError                  Traceback (most recent call last)
      Cell In [15], line 1
      ----> 1 list_arr.to_numpy()
      
      File /mnt/lance/.venv/lance/lib/python3.10/site-packages/pyarrow/array.pxi:1445, in pyarrow.lib.Array.to_numpy()
      
      File /mnt/lance/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:121, in pyarrow.lib.check_status()
      
      ArrowNotImplementedError: Not implemented type for Arrow list to pandas: extension<label<LabelType>>
      

      As mentioned on the user thread linked from the top, a fairly generic solution would just have the conversion default to the storage array's to_numpy.

       
      pandas/numpy => Arrow

      Equivalently, conversion to Arrow is also difficult for nested extension types:

      if I have say a pandas DataFrame that has a column of list-of-string and I want to convert that to list-of-label Array. Currently I have to:
      1. Convert to list-of-string (storage) numpy array to pa.list_(pa.string())
      2. Convert the string values array to ExtensionArray, then reconstitue a list<extension> array using the ExtensionArray combined with the offsets from the result of step 1

      import pyarrow as pa
      import pandas as pd
      df = pd.DataFrame({'labels': [["dog", "horse", "cat"], ["person", "person", "car", "car"]]})
      list_of_storage = pa.array(df.labels)
      ext_values = pa.ExtensionArray.from_storage(LabelType(), list_of_storage.values)
      list_of_ext = pa.ListArray.from_arrays(offsets=list_of_storage.offsets, values=ext_values)
      

      For non-nested columns, one can achieve easier conversion by defining a pandas extension dtype, but i don't think that works for a nested column. You would instead have to fallback to something like `pa.ExtensionArray.from_storage` (or `from_pandas`?) to do the trick. Even that doesn't necessarily work for something like a dictionary column because you'd have to pass in the dictionary somehow. Off the cuff, one could provide a custom lambda to `pa.Table.from_pandas` that is used for either specified column names / data types?

      Thanks in advance for the consideration!

      Attachments

        Issue Links

          Activity

            People

              milesgranger Miles Granger
              changhiskhan Chang She
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h 10m
                  3h 10m