Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
7.0.0
-
None
-
pandas 1.3.5
pyarrow 7.0.0
python 3.10.4
Description
This issue is present in pyarrow v7.0.0, but not in v6.0.1.
Pyarrow errors when attempting to read from a parquet file with an empty filter on a string and categorical column. These are columns "E" and "F". Interestingly the issue is not present in v7.0.0 when reading from a float, timestamp or integer column ("A" through "D").
The following Python code presents a minimal example which reproduces the issue:
import pandas as pd import numpy as np path = './example_df.parquet' df = pd.DataFrame( { "A": 1.0, "B": pd.Timestamp("20130102"), "C": pd.Series(1, index=list(range(4)), dtype="float32"), "D": np.array([3] * 4, dtype="int32"), "E": pd.Categorical(["test", "train", "test", "train"]), "F": "foo", } ) df.to_parquet(path) # Works! df_read = pd.read_parquet( path, filters=[ [ ("A", "in", set()) ] ] ) # Pyarrow v6.0.1 and v7.0.0 # # Empty DataFrame # Columns: [A, B, C, D, E, F] # Index: [] print(df_read) # Fails! df_read = pd.read_parquet( path, filters=[ [ ("F", "in", set()) ] ] ) # Pyarrow v6.0.1 # # Empty DataFrame # Columns: [A, B, C, D, E, F] # Index: [] # Pyarrow v7.0.0 # # pyarrow.lib.ArrowInvalid: Array type didn't match type of values set: string vs null print(df_read)