Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
If you have filter on a column where the physical and dataset schema differs, scanning aborts (as right now the dataset schema, if specified, gets used for implicit casts, but then the expression might have a different type as the actual physical column):
Small parquet file with one int32 column:
df = pd.DataFrame({"col": np.array([1, 2, 3, 4], dtype='int32')}) df.to_parquet("test_filter_schema.parquet", engine="pyarrow") import pyarrow.dataset as ds dataset = ds.dataset("test_filter_schema.parquet", format="parquet") fragment = list(dataset.get_fragments())[0]
and then reading in a fragment with a filter on that column, without and with specifying a dataset/read schema:
In [48]: fragment.to_table(filter=ds.field("col") > 2).to_pandas() Out[48]: col 0 3 1 4 In [49]: fragment.to_table(filter=ds.field("col") > 2, schema=pa.schema([("col", pa.int64())])).to_pandas() ../src/arrow/result.cc:28: ValueOrDie called on an error: Type error: Cannot compare scalars of differing type: int64 vs int32 /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.100(+0xee2f86)[0x7f6b56490f86] ... Aborted (core dumped)
Now this int32->int64 type change is something we don't support yet (in the schema evolution/normalization, when scanning a dataset), but it also shouldn't abort but raise a normal error about type mismatch.
Attachments
Issue Links
- links to