Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10577

[Rust][DataFusion] Hash Aggregator stream finishes unexpectedly after going to Pending state

    XMLWordPrintableJSON

Details

    Description

      This happens when executing a DataFusion query plan with hash aggregation where the data source is not ready on the first call by the Executor, and the async state machine is passed to a pending state

      In the Stream implem of GroupedHashAggregateStream and HashAggregateStream, the state is set to self.finished = true on the first call to poll_next(). If the inner stream is Poll::Pending on the first call, this means that the next call resolves to Poll::Ready(None), thus finishing the stream instead of actually consuming the inner data.

      I think that it does not happen with most current sources because they never trigger the Poll::Pending state. Parquet is implemented with a blocking call inside poll_next() (which is also problematic but an other issue), Memory yields directly, and CSV also always yields Poll::Ready

      An analysis should be performed on all physical plans to check if the issue occurs in other places.

      Attachments

        Issue Links

          Activity

            People

              rdettai Rémi Dettai
              rdettai Rémi Dettai
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 40m
                  1h 40m