Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
5.0.0
-
None
-
None
-
NAME="CentOS Linux"
VERSION="7 (Core)"
Description
When converting from a pandas dataframe to a table, categorical variables are by default given an index type int8 (presumably because there are fewer than 128 categories) in the schema. When this is written to a parquet file, the schema changes such that the index type is int32 instead. This causes an inconsistency between the schemas of tables derived from pandas and those read from disk.
A minimal recreation of the issue is as follows:
import numpy as np import pandas as pd import pyarrow as pa import pyarrow.parquet as pq df = pd.DataFrame({"A": [1, 2, 3, 4, 5], "B": ["a", "a", "b", "c", "b"]}) dtypes = { "A": np.dtype("int8"), "B": pd.CategoricalDtype(categories=["a", "b", "c"], ordered=None), } df = df.astype(dtypes) tbl = pa.Table.from_pandas( df, ) where = "tmp.parquet" filesystem = pa.fs.LocalFileSystem() pq.write_table( tbl, filesystem.open_output_stream( where, compression=None, ), version="2.0", ) schema = tbl.schema read_schema = pq.ParquetFile( filesystem.open_input_file(where), ).schema_arrow
By printing schema and read_schema, you can the inconsistency.
I have workarounds in place for this, but am raising the issue anyway so that you can resolve it properly.