Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-17625

Cast error on roundtrip of categorical column to parquet and back

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 9.0.0
    • None
    • Parquet, Python

    Description

      Writing a table to parquet, then reading it back fails if:

      1. One of the columns is a dictionary (came from a pandas Categorical), and
      2. Passing the table's schema to `read_table`

      Failing on attempt to cast int64 into dictionary (full stack trace below).

      This seems related to ARROW-11157 - but even if losing the categorical type when reading from parquet, the reader should not barf when reading with the schema.

      Minimal example of failing code:

      import pandas as pd
      import pyarrow as pa
      import pyarrow.parquet as pq
      import pyarrow.dataset as ds
      a = [1,2,3,4,1,2,3,4,1,2,3,4]
      b = ["a" for i in a]
      c = [i for i in range(len(a))]
      df = pd.DataFrame({"a":a, "b":b, "c":c})
      df['a'] = df['a'].astype('category')
      print("df dtypes:\n", df.dtypes)
      t = pa.Table.from_pandas(df, preserve_index=True)
      s = t.schema
      ds.write_dataset(t, format='parquet', base_dir='./test')
      df2 = pq.read_table('./test', schema=s).to_pandas()
      print("df2 dtypes:\n", df2.dtypes)
      

       

      Which gives: 

      df dtypes:
       a    category
      b      object
      c       int64
      dtype: object
      Traceback (most recent call last):
        File "/Users/yishai/lab/pyarrow_bug/reproduce.py", line 20, in <module>
          df2 = pq.read_table('./test', schema=s).to_pandas()
        File "/Users/yishai/lab/pyarrow_bug/venv/lib/python3.9/site-packages/pyarrow/parquet/_init_.py", line 2827, in read_table
          return dataset.read(columns=columns, use_threads=use_threads,
        File "/Users/yishai/lab/pyarrow_bug/venv/lib/python3.9/site-packages/pyarrow/parquet/_init_.py", line 2473, in read
          table = self._dataset.to_table(
        File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table
        File "pyarrow/_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table
        File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
        File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
      pyarrow.lib.ArrowNotImplementedError: Unsupported cast from int64 to dictionary using function cast_dictionary
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            yishaibeeri Yishai Beeri
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: