A structured streaming query with a streaming aggregation can throw the following error in rare cases.
This can happen when the following conditions are accidentally hit.
- Streaming aggregation with aggregation function that is a subset of TypedImperativeAggregation (for example, collect_set, collect_list, percentile, etc.).
- Query running in update mode
- After the shuffle, a partition has exactly 128 records.
This happens because of the following.
- The StateStoreSaveExec used in streaming aggregations has the following logic when used in update mode.
- There is an iterator that reads data from its parent iterator and updates the StateStore.
- When the parent iterator is fully consumed (i.e. baseIterator.hasNext returns false) then all state changes are committed by calling StateStore.commit.
- The implementation of StateStore.commit() in HDFSBackedStateStore does not allow itself to be called twice. However, the logic is such that, if hasNext is called multiple times after baseIterator.hasNext has returned false then each time it will call StateStore.commit.
- For most aggregation functions, this is okay because hasNext is only called once. But thats not the case with ImperativeTypedAggregates.
- ImperativeTypedAggregates are executed using ObjectHashAggregateExec which will try to use two kinds of hashmaps for aggregations.
- It will first try to use an unsorted hashmap. If the size of the hashmap increases beyond a certain threshold (default 128), then it will switch to using a sorted hashmap.
- The switching logic in ObjectAggregationIterator (used by ObjectHashAggregateExec) is such that when the number of records matches the threshold (i.e. 128), it will end up calling the iterator.hasNext twice.
When combined with the above two conditions are combined, it leads to the above error. This latent bug has existed since Spark 2.1 when ObjectHashAggregateExec was introduced in Spark.