[ARROW-9146] [C++][Dataset] Scanning a Fragment with a filter + mismatching schema shouldn't abort - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.0.0
Component/s: C++
Labels:
- dataset
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/25254

Description

If you have filter on a column where the physical and dataset schema differs, scanning aborts (as right now the dataset schema, if specified, gets used for implicit casts, but then the expression might have a different type as the actual physical column):

Small parquet file with one int32 column:

df = pd.DataFrame({"col": np.array([1, 2, 3, 4], dtype='int32')})
df.to_parquet("test_filter_schema.parquet", engine="pyarrow")

import pyarrow.dataset as ds
dataset = ds.dataset("test_filter_schema.parquet", format="parquet")
fragment = list(dataset.get_fragments())[0]

and then reading in a fragment with a filter on that column, without and with specifying a dataset/read schema:

In [48]: fragment.to_table(filter=ds.field("col") > 2).to_pandas() 
Out[48]: 
   col
0    3
1    4

In [49]: fragment.to_table(filter=ds.field("col") > 2, schema=pa.schema([("col", pa.int64())])).to_pandas()                                                                                                        
../src/arrow/result.cc:28: ValueOrDie called on an error: Type error: Cannot compare scalars of differing type: int64 vs int32
/home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.100(+0xee2f86)[0x7f6b56490f86]
...
Aborted (core dumped)

Now this int32->int64 type change is something we don't support yet (in the schema evolution/normalization, when scanning a dataset), but it also shouldn't abort but raise a normal error about type mismatch.

Attachments

Issue Links

links to

GitHub Pull Request #7526

Activity

People

Assignee:: Ben Kietzman

Reporter:: Joris Van den Bossche

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 16/Jun/20 13:56

Updated:: 11/Jan/23 08:04

Resolved:: 26/Jun/20 19:50

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: