Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.17.1, 1.0.1
-
Platform: Linux-5.8.9-050809-generic-x86_64-with-glibc2.10
Python version: 3.8.5 (default, Aug 5 2020, 08:36:46)
[GCC 7.3.0]
Pandas version: 1.1.2
pyarrow version: 1.0.1
Description
I apologise if this is a known issue; I looked both in this issue tracker and on github and I didn't find it.
There seems to be a problem reading a dataset with predicate pushdown (filters) on columns with categorical data. The problem only occurs with `use_legacy_dataset=False` (but if that's True it has no effect if the column isn't a partition key.
Reproducer:
import shutil import sys, platform from pathlib import Path import pyarrow as pa import pyarrow.parquet as pq import pandas as pd # Settings CATEGORICAL_DTYPE = True USE_LEGACY_DATASET = False print('Platform:', platform.platform()) print('Python version:', sys.version) print('Pandas version:', pd.__version__) print('pyarrow version:', pa.__version__) print('categorical enabled:', CATEGORICAL_DTYPE) print('use_legacy_dataset:', USE_LEGACY_DATASET) print() # Clean up test dataset if present path = Path('blah.parquet') if path.exists(): shutil.rmtree(str(path)) # Simple data d = dict(col1=['a', 'b'], col2=[1, 2]) # Either categorical or not if CATEGORICAL_DTYPE: df = pd.DataFrame(data=d, dtype='category') else: df = pd.DataFrame(data=d) # Write dataset table = pa.Table.from_pandas(df) pq.write_to_dataset(table, str(path)) # Load dataset table = pq.read_table( str(path), filters=[('col1', '=', 'a')], use_legacy_dataset=USE_LEGACY_DATASET, ) df = table.to_pandas() print(df.dtypes) print(repr(df))
Output:
$ python categorical_predicate_pushdown.py Platform: Linux-5.8.9-050809-generic-x86_64-with-glibc2.10 Python version: 3.8.5 (default, Aug 5 2020, 08:36:46) [GCC 7.3.0] Pandas version: 1.1.2 pyarrow version: 1.0.1 categorical enabled: True use_legacy_dataset: False /arrow/cpp/src/arrow/result.cc:28: ValueOrDie called on an error: Type error: Cannot compare scalars of differing type: dictionary<values=string, indices=int32, ordered=0> vs string /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow.so.100(+0x4fc128)[0x7f50568c6128] /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow4util8ArrowLogD1Ev+0xdd)[0x7f50568c693d] /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow8internal14DieWithMessageERKSs+0x51)[0x7f50569757c1] /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow8internal17InvalidValueOrDieERKNS_6StatusE+0x4c)[0x7f505697716c] /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset20ComparisonExpression21AssumeGivenComparisonERKS1_+0x438)[0x7f5043334f18] /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset20ComparisonExpression6AssumeERKNS0_10ExpressionE+0x34)[0x7f5043334fa4] /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset20ComparisonExpression6AssumeERKNS0_10ExpressionE+0xce)[0x7f504333503e] /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset20ComparisonExpression6AssumeERKNS0_10ExpressionE+0xce)[0x7f504333503e] /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset12RowGroupInfo7SatisfyERKNS0_10ExpressionE+0x1c)[0x7f50433116ac] /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZN5arrow7dataset19ParquetFileFragment15FilterRowGroupsERKNS0_10ExpressionE+0x563)[0x7f5043311cb3] /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset17ParquetFileFormat8ScanFileESt10shared_ptrINS0_11ScanOptionsEES2_INS0_11ScanContextEEPNS0_12FileFragmentE+0x203)[0x7f50433168a3] /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZN5arrow7dataset12FileFragment4ScanESt10shared_ptrINS0_11ScanOptionsEES2_INS0_11ScanContextEE+0x55)[0x7f5043329785] /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZZN5arrow7dataset19GetScanTaskIteratorENS_8IteratorISt10shared_ptrINS0_8FragmentEEEES2_INS0_11ScanOptionsEES2_INS0_11ScanContextEEENKUlS4_E_clES4_+0x91)[0x7f50433485a1] /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZN5arrow8IteratorINS0_ISt10shared_ptrINS_7dataset8ScanTaskEEEEE4NextINS_11MapIteratorIZNS2_19GetScanTaskIteratorENS0_IS1_INS2_8FragmentEEEES1_INS2_11ScanOptionsEES1_INS2_11ScanContextEEEUlSA_E_SA_S5_EEEENS_6ResultIS5_EEPv+0xde)[0x7f504334b55e] /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZN5arrow15FlattenIteratorISt10shared_ptrINS_7dataset8ScanTaskEEE4NextEv+0x127)[0x7f50433616b7] /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZN5arrow8IteratorISt10shared_ptrINS_7dataset8ScanTaskEEE4NextINS_15FlattenIteratorIS4_EEEENS_6ResultIS4_EEPv+0x14)[0x7f5043361874] /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZN5arrow7dataset7Scanner7ToTableEv+0x611)[0x7f5043336691] /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/_dataset.cpython-38-x86_64-linux-gnu.so(+0x3b150)[0x7f50435c9150] /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/_dataset.cpython-38-x86_64-linux-gnu.so(+0x2c0eb)[0x7f50435ba0eb] /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/_dataset.cpython-38-x86_64-linux-gnu.so(+0x2d9ab)[0x7f50435bb9ab] python(PyCFunction_Call+0x56)[0x562843a6dce6] python(_PyObject_MakeTpCall+0x22f)[0x562843a2b5cf] python(_PyEval_EvalFrameDefault+0x11d7)[0x562843aaf727] python(_PyEval_EvalCodeWithName+0x2d2)[0x562843a78802] python(+0x18bb80)[0x562843a79b80] python(+0x1001e3)[0x5628439ee1e3] python(_PyEval_EvalCodeWithName+0x2d2)[0x562843a78802] python(_PyFunction_Vectorcall+0x1e3)[0x562843a797a3] python(+0x1001e3)[0x5628439ee1e3] python(_PyEval_EvalCodeWithName+0x2d2)[0x562843a78802] python(PyEval_EvalCodeEx+0x44)[0x562843a795b4] python(PyEval_EvalCode+0x1c)[0x562843b07bdc] python(+0x219c84)[0x562843b07c84] python(+0x24be94)[0x562843b39e94] python(PyRun_FileExFlags+0xa1)[0x562843a0279a] python(PyRun_SimpleFileExFlags+0x3b4)[0x562843a02b7f] python(+0x115a44)[0x562843a03a44] python(Py_BytesMain+0x39)[0x562843b3c9b9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f5058f2a0b3] python(+0x1dea83)[0x562843acca83] Aborted (core dumped)
With `CATEGORICAL_DTYPE = False`, it works as expected:
$ python categorical_predicate_pushdown.py Platform: Linux-5.8.9-050809-generic-x86_64-with-glibc2.10 Python version: 3.8.5 (default, Aug 5 2020, 08:36:46) [GCC 7.3.0] Pandas version: 1.1.2 pyarrow version: 1.0.1 categorical enabled: False use_legacy_dataset: Falsecol1 object col2 int64 dtype: object col1 col2 0 a 1
Attachments
Issue Links
- links to