Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
2.0.0
Description
This happens when executing a DataFusion query plan with hash aggregation where the data source is not ready on the first call by the Executor, and the async state machine is passed to a pending state
In the Stream implem of GroupedHashAggregateStream and HashAggregateStream, the state is set to self.finished = true on the first call to poll_next(). If the inner stream is Poll::Pending on the first call, this means that the next call resolves to Poll::Ready(None), thus finishing the stream instead of actually consuming the inner data.
I think that it does not happen with most current sources because they never trigger the Poll::Pending state. Parquet is implemented with a blocking call inside poll_next() (which is also problematic but an other issue), Memory yields directly, and CSV also always yields Poll::Ready
An analysis should be performed on all physical plans to check if the issue occurs in other places.
Attachments
Issue Links
- links to