Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-13480

[C++] [R] [Python] Dataset SyncScanner may freeze on error

    XMLWordPrintableJSON

    Details

      Description

      Working on integration with DuckDB, we ran into an issue where it looks like errors are not being propagated fully/correctly with record batch readers using the C-interface. The DuckDB issue where this came up is https://github.com/duckdb/duckdb/issues/2055

      In the example I'm passing a dataset with either one or two files from R to python. I've specifically mis-specified the schema to get an error The one file version works like I expect percolating the error up:

      > library("arrow")
      > 
      > venv <- try(reticulate::virtualenv_create("arrow-test"))
      virtualenv: arrow-test
      > install_pyarrow("arrow-test", nightly = TRUE)
      [output from installing pyarrow ...]
      > reticulate::use_virtualenv("arrow-test")
      > 
      > file <- "arrow/r/inst/v0.7.1.parquet"
      > arrow_table <- arrow::open_dataset(rep(file, 1), schema(x=arrow::null()))
      > 
      > scan <- Scanner$create(arrow_table)
      > reader <- scan$ToRecordBatchReader()
      > pyreader <- reticulate::r_to_py(reader)
      > pytab <- pyreader$read_all()
      Error in py_call_impl(callable, dots$args, dots$keywords) : 
        OSError: NotImplemented: Unsupported cast from double to null using function cast_null
      
      Detailed traceback:
        File "pyarrow/ipc.pxi", line 563, in pyarrow.lib.RecordBatchReader.read_all
        File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
      

      But when having 2 (or more) files, the process hangs reading all of the batches:

      > library("arrow")
      > 
      > venv <- try(reticulate::virtualenv_create("arrow-test"))
      virtualenv: arrow-test
      > install_pyarrow("arrow-test", nightly = TRUE)
      [output from installing pyarrow ...]
      > reticulate::use_virtualenv("arrow-test")
      > 
      > file <- "arrow/r/inst/v0.7.1.parquet"
      > arrow_table <- arrow::open_dataset(rep(file, 2), schema(x=arrow::null()))
      > 
      > scan <- Scanner$create(arrow_table)
      > reader <- scan$ToRecordBatchReader()
      > pyreader <- reticulate::r_to_py(reader)
      > pytab <- pyreader$read_all()
      {hangs forever here}
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                jonkeane Jonathan Keane
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 10m
                  1h 10m