Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10008

[Python] pyarrow.parquet.read_table fails with predicate pushdown on categorical data with use_legacy_dataset=False

    XMLWordPrintableJSON

Details

    Description

      I apologise if this is a known issue; I looked both in this issue tracker and on github and I didn't find it.

      There seems to be a problem reading a dataset with predicate pushdown (filters) on columns with categorical data. The problem only occurs with `use_legacy_dataset=False` (but if that's True it has no effect if the column isn't a partition key.

      Reproducer:

      import shutil
      import sys, platform
      from pathlib import Path
      import pyarrow as pa
      import pyarrow.parquet as pq
      import pandas as pd
      
      # Settings
      CATEGORICAL_DTYPE = True
      USE_LEGACY_DATASET = False
      
      print('Platform:', platform.platform())
      print('Python version:', sys.version)
      print('Pandas version:', pd.__version__)
      print('pyarrow version:', pa.__version__)
      print('categorical enabled:', CATEGORICAL_DTYPE)
      print('use_legacy_dataset:', USE_LEGACY_DATASET)
      print()
      
      # Clean up test dataset if present
      path = Path('blah.parquet')
      if path.exists():
          shutil.rmtree(str(path))
      
      # Simple data
      d = dict(col1=['a', 'b'], col2=[1, 2])
      
      # Either categorical or not
      if CATEGORICAL_DTYPE:
          df = pd.DataFrame(data=d, dtype='category')
      else:
          df = pd.DataFrame(data=d)
      
      # Write dataset
      table = pa.Table.from_pandas(df)
      pq.write_to_dataset(table, str(path))
      
      # Load dataset
      table = pq.read_table(
          str(path),
          filters=[('col1', '=', 'a')],
          use_legacy_dataset=USE_LEGACY_DATASET,
      )
      df = table.to_pandas()
      print(df.dtypes)
      print(repr(df))
      
      

       Output:

      $ python categorical_predicate_pushdown.py 
      Platform: Linux-5.8.9-050809-generic-x86_64-with-glibc2.10
      Python version: 3.8.5 (default, Aug  5 2020, 08:36:46) 
      [GCC 7.3.0]
      Pandas version: 1.1.2
      pyarrow version: 1.0.1
      categorical enabled: True
      use_legacy_dataset: False
      
      /arrow/cpp/src/arrow/result.cc:28: ValueOrDie called on an error: Type error: Cannot compare scalars of differing type: dictionary<values=string, indices=int32, ordered=0> vs string
      /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow.so.100(+0x4fc128)[0x7f50568c6128]
      /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow4util8ArrowLogD1Ev+0xdd)[0x7f50568c693d]
      /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow8internal14DieWithMessageERKSs+0x51)[0x7f50569757c1]
      /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow8internal17InvalidValueOrDieERKNS_6StatusE+0x4c)[0x7f505697716c]
      /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset20ComparisonExpression21AssumeGivenComparisonERKS1_+0x438)[0x7f5043334f18]
      /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset20ComparisonExpression6AssumeERKNS0_10ExpressionE+0x34)[0x7f5043334fa4]
      /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset20ComparisonExpression6AssumeERKNS0_10ExpressionE+0xce)[0x7f504333503e]
      /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset20ComparisonExpression6AssumeERKNS0_10ExpressionE+0xce)[0x7f504333503e]
      /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset12RowGroupInfo7SatisfyERKNS0_10ExpressionE+0x1c)[0x7f50433116ac]
      /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZN5arrow7dataset19ParquetFileFragment15FilterRowGroupsERKNS0_10ExpressionE+0x563)[0x7f5043311cb3]
      /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset17ParquetFileFormat8ScanFileESt10shared_ptrINS0_11ScanOptionsEES2_INS0_11ScanContextEEPNS0_12FileFragmentE+0x203)[0x7f50433168a3]
      /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZN5arrow7dataset12FileFragment4ScanESt10shared_ptrINS0_11ScanOptionsEES2_INS0_11ScanContextEE+0x55)[0x7f5043329785]
      /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZZN5arrow7dataset19GetScanTaskIteratorENS_8IteratorISt10shared_ptrINS0_8FragmentEEEES2_INS0_11ScanOptionsEES2_INS0_11ScanContextEEENKUlS4_E_clES4_+0x91)[0x7f50433485a1]
      /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZN5arrow8IteratorINS0_ISt10shared_ptrINS_7dataset8ScanTaskEEEEE4NextINS_11MapIteratorIZNS2_19GetScanTaskIteratorENS0_IS1_INS2_8FragmentEEEES1_INS2_11ScanOptionsEES1_INS2_11ScanContextEEEUlSA_E_SA_S5_EEEENS_6ResultIS5_EEPv+0xde)[0x7f504334b55e]
      /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZN5arrow15FlattenIteratorISt10shared_ptrINS_7dataset8ScanTaskEEE4NextEv+0x127)[0x7f50433616b7]
      /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZN5arrow8IteratorISt10shared_ptrINS_7dataset8ScanTaskEEE4NextINS_15FlattenIteratorIS4_EEEENS_6ResultIS4_EEPv+0x14)[0x7f5043361874]
      /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZN5arrow7dataset7Scanner7ToTableEv+0x611)[0x7f5043336691]
      /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/_dataset.cpython-38-x86_64-linux-gnu.so(+0x3b150)[0x7f50435c9150]
      /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/_dataset.cpython-38-x86_64-linux-gnu.so(+0x2c0eb)[0x7f50435ba0eb]
      /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/_dataset.cpython-38-x86_64-linux-gnu.so(+0x2d9ab)[0x7f50435bb9ab]
      python(PyCFunction_Call+0x56)[0x562843a6dce6]
      python(_PyObject_MakeTpCall+0x22f)[0x562843a2b5cf]
      python(_PyEval_EvalFrameDefault+0x11d7)[0x562843aaf727]
      python(_PyEval_EvalCodeWithName+0x2d2)[0x562843a78802]
      python(+0x18bb80)[0x562843a79b80]
      python(+0x1001e3)[0x5628439ee1e3]
      python(_PyEval_EvalCodeWithName+0x2d2)[0x562843a78802]
      python(_PyFunction_Vectorcall+0x1e3)[0x562843a797a3]
      python(+0x1001e3)[0x5628439ee1e3]
      python(_PyEval_EvalCodeWithName+0x2d2)[0x562843a78802]
      python(PyEval_EvalCodeEx+0x44)[0x562843a795b4]
      python(PyEval_EvalCode+0x1c)[0x562843b07bdc]
      python(+0x219c84)[0x562843b07c84]
      python(+0x24be94)[0x562843b39e94]
      python(PyRun_FileExFlags+0xa1)[0x562843a0279a]
      python(PyRun_SimpleFileExFlags+0x3b4)[0x562843a02b7f]
      python(+0x115a44)[0x562843a03a44]
      python(Py_BytesMain+0x39)[0x562843b3c9b9]
      /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f5058f2a0b3]
      python(+0x1dea83)[0x562843acca83]
      Aborted (core dumped)
      

      With `CATEGORICAL_DTYPE = False`, it works as expected:

      $ python categorical_predicate_pushdown.py 
      Platform: Linux-5.8.9-050809-generic-x86_64-with-glibc2.10
      Python version: 3.8.5 (default, Aug  5 2020, 08:36:46) 
      [GCC 7.3.0]
      Pandas version: 1.1.2
      pyarrow version: 1.0.1
      categorical enabled: False
      use_legacy_dataset: Falsecol1    object
      col2     int64
      dtype: object
        col1  col2
      0    a     1
      
      

       

       

      Attachments

        Issue Links

          Activity

            People

              bkietz Ben Kietzman
              cjrh Caleb Hattingh
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 20m
                  1h 20m