Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-12066

[Python] Dataset API seg fault when filtering string column for None

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.0
    • 7.0.0
    • Python
    • macOS 10.15.7

    Description

      Trying to load a parquet file using the dataset api leads to a segmentation fault when filtering string columns for None values.

      Minimal reproducing example: 

      import pyarrow as pa
      import pyarrow.dataset
      import pyarrow.parquet
      import pandas as pd
      
      path = "./test.parquet"
      df = pd.DataFrame({"A": ("a", "b", None)})
      pa.parquet.write_table(pa.table(df), path)
      
      ds = pa.dataset.dataset(path, format="parquet")
      filter = pa.dataset.field("A") == pa.dataset.scalar(None)
      table = ds.to_table(filter=filter)
      

      Backtrace:

      (lldb) target create "/usr/local/mambaforge/envs/xxx/bin/python"
      Current executable set to '/usr/local/mambaforge/envs/xxx/bin/python' (x86_64).
      (lldb) settings set -- target.run-args  "./tmp.py"
      (lldb) r
      Process 35235 launched: '/usr/local/mambaforge/envs/xxx/bin/python' (x86_64)
      Process 35235 stopped
      * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x9)
          frame #0: 0x000000010314be48 libarrow.300.0.0.dylib`arrow::Status arrow::VisitScalarInline<arrow::ScalarHashImpl>(arrow::Scalar const&, arrow::ScalarHashImpl*) + 104
      libarrow.300.0.0.dylib`arrow::VisitScalarInline<arrow::ScalarHashImpl>:
      ->  0x10314be48 <+104>: cmpb   $0x0, 0x9(%rax)
          0x10314be4c <+108>: je     0x10314c0bc               ; <+732>
          0x10314be52 <+114>: movq   0x10(%rax), %rdi
          0x10314be56 <+118>: movq   0x20(%rax), %rsi
      Target 0: (python) stopped.
      (lldb) bt
      * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x9)
        * frame #0: 0x000000010314be48 libarrow.300.0.0.dylib`arrow::Status arrow::VisitScalarInline<arrow::ScalarHashImpl>(arrow::Scalar const&, arrow::ScalarHashImpl*) + 104
          frame #1: 0x000000010314bd4f libarrow.300.0.0.dylib`arrow::ScalarHashImpl::AccumulateHashFrom(arrow::Scalar const&) + 111
          frame #2: 0x0000000103134bca libarrow.300.0.0.dylib`arrow::Scalar::Hash::hash(arrow::Scalar const&) + 42
          frame #3: 0x0000000132fa0ea8 libarrow_dataset.300.0.0.dylib`arrow::dataset::Expression::hash() const + 264
          frame #4: 0x0000000132fc913c libarrow_dataset.300.0.0.dylib`std::__1::__hash_const_iterator<std::__1::__hash_node<arrow::dataset::Expression, void*>*> std::__1::__hash_table<arrow::dataset::Expression, arrow::dataset::Expression::Hash, std::__1::equal_to<arrow::dataset::Expression>, std::__1::allocator<arrow::dataset::Expression> >::find<arrow::dataset::Expression>(arrow::dataset::Expression const&) const + 28
          frame #5: 0x0000000132faca9b libarrow_dataset.300.0.0.dylib`arrow::Result<arrow::dataset::Expression> arrow::dataset::Modify<arrow::dataset::Canonicalize(arrow::dataset::Expression, arrow::compute::ExecContext*)::$_1, arrow::dataset::Canonicalize(arrow::dataset::Expression, arrow::compute::ExecContext*)::$_9>(arrow::dataset::Expression, arrow::dataset::Canonicalize(arrow::dataset::Expression, arrow::compute::ExecContext*)::$_1 const&, arrow::dataset::Canonicalize(arrow::dataset::Expression, arrow::compute::ExecContext*)::$_9 const&) + 123
          frame #6: 0x0000000132fac623 libarrow_dataset.300.0.0.dylib`arrow::dataset::Canonicalize(arrow::dataset::Expression, arrow::compute::ExecContext*) + 131
          frame #7: 0x0000000132fac76d libarrow_dataset.300.0.0.dylib`arrow::dataset::Canonicalize(arrow::dataset::Expression, arrow::compute::ExecContext*) + 461
          frame #8: 0x0000000132fb00cb libarrow_dataset.300.0.0.dylib`arrow::dataset::SimplifyWithGuarantee(arrow::dataset::Expression, arrow::dataset::Expression const&)::$_10::operator()() const + 75
          frame #9: 0x0000000132faf6b5 libarrow_dataset.300.0.0.dylib`arrow::dataset::SimplifyWithGuarantee(arrow::dataset::Expression, arrow::dataset::Expression const&) + 517
          frame #10: 0x0000000132f893f8 libarrow_dataset.300.0.0.dylib`arrow::dataset::Dataset::GetFragments(arrow::dataset::Expression) + 88
          frame #11: 0x0000000132f8d25c libarrow_dataset.300.0.0.dylib`arrow::dataset::GetFragmentsFromDatasets(std::__1::vector<std::__1::shared_ptr<arrow::dataset::Dataset>, std::__1::allocator<std::__1::shared_ptr<arrow::dataset::Dataset> > > const&, arrow::dataset::Expression)::'lambda'(std::__1::shared_ptr<arrow::dataset::Dataset>)::operator()(std::__1::shared_ptr<arrow::dataset::Dataset>) const + 76
          frame #12: 0x0000000132f8cd6c libarrow_dataset.300.0.0.dylib`arrow::MapIterator<arrow::dataset::GetFragmentsFromDatasets(std::__1::vector<std::__1::shared_ptr<arrow::dataset::Dataset>, std::__1::allocator<std::__1::shared_ptr<arrow::dataset::Dataset> > > const&, arrow::dataset::Expression)::'lambda'(std::__1::shared_ptr<arrow::dataset::Dataset>), std::__1::shared_ptr<arrow::dataset::Dataset>, arrow::Iterator<std::__1::shared_ptr<arrow::dataset::Fragment> > >::Next() + 316
          frame #13: 0x0000000132f8cb27 libarrow_dataset.300.0.0.dylib`arrow::Result<arrow::Iterator<std::__1::shared_ptr<arrow::dataset::Fragment> > > arrow::Iterator<arrow::Iterator<std::__1::shared_ptr<arrow::dataset::Fragment> > >::Next<arrow::MapIterator<arrow::dataset::GetFragmentsFromDatasets(std::__1::vector<std::__1::shared_ptr<arrow::dataset::Dataset>, std::__1::allocator<std::__1::shared_ptr<arrow::dataset::Dataset> > > const&, arrow::dataset::Expression)::'lambda'(std::__1::shared_ptr<arrow::dataset::Dataset>), std::__1::shared_ptr<arrow::dataset::Dataset>, arrow::Iterator<std::__1::shared_ptr<arrow::dataset::Fragment> > > >(void*) + 39
          frame #14: 0x0000000132f8dcdb libarrow_dataset.300.0.0.dylib`arrow::Iterator<arrow::Iterator<std::__1::shared_ptr<arrow::dataset::Fragment> > >::Next() + 43
          frame #15: 0x0000000132f8d692 libarrow_dataset.300.0.0.dylib`arrow::FlattenIterator<std::__1::shared_ptr<arrow::dataset::Fragment> >::Next() + 258
          frame #16: 0x0000000132f8d477 libarrow_dataset.300.0.0.dylib`arrow::Result<std::__1::shared_ptr<arrow::dataset::Fragment> > arrow::Iterator<std::__1::shared_ptr<arrow::dataset::Fragment> >::Next<arrow::FlattenIterator<std::__1::shared_ptr<arrow::dataset::Fragment> > >(void*) + 39
          frame #17: 0x0000000132f8de0b libarrow_dataset.300.0.0.dylib`arrow::Iterator<std::__1::shared_ptr<arrow::dataset::Fragment> >::Next() + 43
          frame #18: 0x0000000132fffe80 libarrow_dataset.300.0.0.dylib`arrow::MapIterator<arrow::dataset::GetScanTaskIterator(arrow::Iterator<std::__1::shared_ptr<arrow::dataset::Fragment> >, std::__1::shared_ptr<arrow::dataset::ScanOptions>, std::__1::shared_ptr<arrow::dataset::ScanContext>)::'lambda'(std::__1::shared_ptr<arrow::dataset::Fragment>), std::__1::shared_ptr<arrow::dataset::Fragment>, arrow::Iterator<std::__1::shared_ptr<arrow::dataset::ScanTask> > >::Next() + 48
          frame #19: 0x0000000132fffd47 libarrow_dataset.300.0.0.dylib`arrow::Result<arrow::Iterator<std::__1::shared_ptr<arrow::dataset::ScanTask> > > arrow::Iterator<arrow::Iterator<std::__1::shared_ptr<arrow::dataset::ScanTask> > >::Next<arrow::MapIterator<arrow::dataset::GetScanTaskIterator(arrow::Iterator<std::__1::shared_ptr<arrow::dataset::Fragment> >, std::__1::shared_ptr<arrow::dataset::ScanOptions>, std::__1::shared_ptr<arrow::dataset::ScanContext>)::'lambda'(std::__1::shared_ptr<arrow::dataset::Fragment>), std::__1::shared_ptr<arrow::dataset::Fragment>, arrow::Iterator<std::__1::shared_ptr<arrow::dataset::ScanTask> > > >(void*) + 39
          frame #20: 0x0000000133003dcb libarrow_dataset.300.0.0.dylib`arrow::Iterator<arrow::Iterator<std::__1::shared_ptr<arrow::dataset::ScanTask> > >::Next() + 43
          frame #21: 0x0000000133003782 libarrow_dataset.300.0.0.dylib`arrow::FlattenIterator<std::__1::shared_ptr<arrow::dataset::ScanTask> >::Next() + 258
          frame #22: 0x0000000133003567 libarrow_dataset.300.0.0.dylib`arrow::Result<std::__1::shared_ptr<arrow::dataset::ScanTask> > arrow::Iterator<std::__1::shared_ptr<arrow::dataset::ScanTask> >::Next<arrow::FlattenIterator<std::__1::shared_ptr<arrow::dataset::ScanTask> > >(void*) + 39
          frame #23: 0x0000000132fd479b libarrow_dataset.300.0.0.dylib`arrow::Iterator<std::__1::shared_ptr<arrow::dataset::ScanTask> >::Next() + 43
          frame #24: 0x0000000132fd44e8 libarrow_dataset.300.0.0.dylib`arrow::Iterator<std::__1::shared_ptr<arrow::dataset::ScanTask> >::RangeIterator::Next() + 88
          frame #25: 0x0000000132ffe43d libarrow_dataset.300.0.0.dylib`arrow::dataset::Scanner::ToTable() + 589
          frame #26: 0x0000000132f2963a _dataset.cpython-39-darwin.so`__pyx_pw_7pyarrow_8_dataset_7Scanner_13to_table(_object*, _object*) + 74
          frame #27: 0x0000000132ef47d4 _dataset.cpython-39-darwin.so`__Pyx_PyObject_CallNoArg(_object*) + 132
          frame #28: 0x0000000132ef0cc9 _dataset.cpython-39-darwin.so`__pyx_pw_7pyarrow_8_dataset_7Dataset_14to_table(_object*, _object*, _object*) + 569
          frame #29: 0x00000001000d5a04 python`cfunction_call + 52
          frame #30: 0x0000000100074998 python`_PyObject_MakeTpCall + 136
          frame #31: 0x00000001001aa8f3 python`call_function + 323
          frame #32: 0x00000001001a843f python`_PyEval_EvalFrameDefault + 45039
          frame #33: 0x000000010019bc04 python`_PyEval_EvalCode + 548
          frame #34: 0x000000010020ec51 python`pyrun_file + 321
          frame #35: 0x000000010020e49c python`pyrun_simple_file + 412
          frame #36: 0x000000010020e2ad python`PyRun_SimpleFileExFlags + 109
          frame #37: 0x0000000100239ed9 python`pymain_run_file + 329
          frame #38: 0x00000001002395c0 python`pymain_run_python + 992
          frame #39: 0x0000000100239185 python`Py_RunMain + 37
          frame #40: 0x000000010023a8f1 python`pymain_main + 49
          frame #41: 0x0000000100001b48 python`main + 56
          frame #42: 0x00007fff73ab2cc9 libdyld.dylib`start + 1
      

      Attachments

        Issue Links

          Activity

            People

              jorisvandenbossche Joris Van den Bossche
              ThomasBlauthQC Thomas Blauth
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h

                  Slack

                    Issue deployment