Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16838

[Python] Schema inference for pandas extension dtypes fails on indexes

    XMLWordPrintableJSON

Details

    Description

      Hi! pa.Schema.from_pandas called on a dataframe whose index is a pandas extension dtype (e.g., string[python]) results in an error:

      import pyarrow as pa
      df = pd.DataFrame({"a": [1, 2]}, index=pd.Index(["A", "B"], dtype="string"))
      pa.Schema.from_pandas(df)
      
      

      produces

      AttributeError                            Traceback (most recent call last)
      /tmp/ipykernel_1827952/3691394220.py in <module>
            1 import pyarrow as pa
            2 df = pd.DataFrame({"a": [1, 2]}, index=pd.Index(["A", "B"], dtype="string"))
      ----> 3 pa.Schema.from_pandas(df)
      
      ~/miniconda3/envs/dask/lib/python3.8/site-packages/pyarrow/types.pxi in pyarrow.lib.Schema.from_pandas()
      
      ~/miniconda3/envs/dask/lib/python3.8/site-packages/pyarrow/pandas_compat.py in dataframe_to_types(df, preserve_index, columns)
          527             type_ = pa.array(c, from_pandas=True).type
          528         elif _pandas_api.is_extension_array_dtype(values):
      --> 529             type_ = pa.array(c.head(0), from_pandas=True).type
          530         else:
          531             values, type_ = get_datetimetz_type(values, c.dtype, None)
      
      AttributeError: 'Index' object has no attribute 'head'
      
      

      If I remove the `head` call, or convert the index to a series manually, things work.

      Reported downstream in https://github.com/dask/dask/issues/9186

      Related issue from a couple of years ago: https://issues.apache.org/jira/browse/ARROW-8159
       

      Attachments

        Issue Links

          Activity

            People

              jrbourbeau James Bourbeau
              ian-r-rose Ian Rose
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h 10m
                  3h 10m