Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
0.11.1, 0.13.0
-
python: 3.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 5.0.0-15-generic
machine: x86_64
processor: x86_64
byteorder: little
pandas: 0.24.2
numpy: 1.16.4
pyarrow: 0.13.0
Description
Writing a string categorical variable to from pandas parquet is read back as string (object dtype). I expected it to be read as category.
The same thing happens if the category is numeric – a numeric category is read back as int64.
In the code below, I tried out an in-memory arrow Table, which successfully translates categories back to pandas. However, when I write to a parquet file, it's not.
In the scheme of things, this isn't a big deal, but it's a small surprise.
import pandas as pd import pyarrow as pa df = pd.DataFrame({'x': pd.Categorical(['a', 'a', 'b', 'b'])}) df.dtypes # category # This works: pa.Table.from_pandas(df).to_pandas().dtypes # category df.to_parquet("categories.parquet") # This reads back object, but I expected category pd.read_parquet("categories.parquet").dtypes # object # Numeric categories have the same issue: df_num = pd.DataFrame({'x': pd.Categorical([1, 1, 2, 2])}) df_num.dtypes # category pa.Table.from_pandas(df_num).to_pandas().dtypes # category df_num.to_parquet("categories_num.parquet") # This reads back int64, but I expected category pd.read_parquet("categories_num.parquet").dtypes # int64
Attachments
Issue Links
- is related to
-
ARROW-3652 [Python] CategoricalIndex is lost after reading back
- Resolved
- relates to
-
ARROW-3246 [Python][Parquet] direct reading/writing of pandas categoricals in parquet
- Resolved
- links to