Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-13982

[C++] Async scanner stalls if a fragment generates no batches

    XMLWordPrintableJSON

Details

    Description

      Reading parquet files using dataset scanner may stall due to a never-finished future. 

      To reproduce this case, one needs two parquet files and sets the filter expression to something that could filter one file completely.  After that, calling `AsyncScanner::ToRecordBatchReader` and read data continually. 

      I also have dug this bug a little. It's caused by the `MakeEmptyGenerator<std::shared_ptr<RecordBatch>>` when filtered row groups is empty, which's ignored by `FragmentToBatches` and causes SequencingGenerator to stall.

      A quick fix is to return a record batch with 0 rows instead of returning a nullptr there.

      Attachments

        1. repro.py
          2 kB
          David Li

        Issue Links

          Activity

            People

              lidavidm David Li
              framlog Huxley Hu
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h
                  3h