Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-1897

[Python] Incorrect numpy_type for pandas metadata of Categoricals

    Details

      Description

      If I'm reading http://pandas-docs.github.io/pandas-docs-travis/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format correctly, the "numpy_type" field of a `Categorical` should be the storage type used for the codes. It looks like pyarrow is just using 'object' always.

      In [1]: import pandas as pd
      
      In [2]: import pyarrow as pa
      
      In [3]: import pyarrow.parquet as pq
      
      In [4]: import io
      
      In [5]: import json
      
      In [6]: df = pd.DataFrame({"A": [1, 2]},
         ...:                   index=pd.CategoricalIndex(['one', 'two'], name='idx'))
         ...:
      In [8]: sink = io.BytesIO()
         ...: pq.write_metadata(pa.Table.from_pandas(df).schema, sink)
         ...: json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['columns'][-1]
         ...:
      Out[8]:
      {'field_name': '__index_level_0__',
       'metadata': {'num_categories': 2, 'ordered': False},
       'name': 'idx',
       'numpy_type': 'object',
       'pandas_type': 'categorical'}
      

      From the spec:

      The numpy_type is the physical storage type of the column, which is the result of str(dtype) for the underlying NumPy array that holds the data. So for datetimetz this is datetime64[ns] and for categorical, it may be any of the supported integer categorical types.

      So 'numpy_type' field should be something like `'int8'` instead of `'object'`

        Attachments

          Activity

            People

            • Assignee:
              cpcloud Phillip Cloud
              Reporter:
              TomAugspurger Tom Augspurger
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: