Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-9146

[C++][Dataset] Scanning a Fragment with a filter + mismatching schema shouldn't abort

    XMLWordPrintableJSON

Details

    Description

      If you have filter on a column where the physical and dataset schema differs, scanning aborts (as right now the dataset schema, if specified, gets used for implicit casts, but then the expression might have a different type as the actual physical column):

      Small parquet file with one int32 column:

      df = pd.DataFrame({"col": np.array([1, 2, 3, 4], dtype='int32')})
      df.to_parquet("test_filter_schema.parquet", engine="pyarrow")
      
      import pyarrow.dataset as ds
      dataset = ds.dataset("test_filter_schema.parquet", format="parquet")
      fragment = list(dataset.get_fragments())[0]
      

      and then reading in a fragment with a filter on that column, without and with specifying a dataset/read schema:

      In [48]: fragment.to_table(filter=ds.field("col") > 2).to_pandas() 
      Out[48]: 
         col
      0    3
      1    4
      
      In [49]: fragment.to_table(filter=ds.field("col") > 2, schema=pa.schema([("col", pa.int64())])).to_pandas()                                                                                                        
      ../src/arrow/result.cc:28: ValueOrDie called on an error: Type error: Cannot compare scalars of differing type: int64 vs int32
      /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.100(+0xee2f86)[0x7f6b56490f86]
      ...
      Aborted (core dumped)
      

      Now this int32->int64 type change is something we don't support yet (in the schema evolution/normalization, when scanning a dataset), but it also shouldn't abort but raise a normal error about type mismatch.

      Attachments

        Issue Links

          Activity

            People

              bkietz Ben Kietzman
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h
                  3h