[SPARK-33484] Kafka flatMapGroupsWithState not always ordered. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 3.0.1
Fix Version/s: None
Component/s: Structured Streaming
Labels:
None
Environment:

Scala 2.12

Spark 3.0.1

Hadoop 3.2.0

Spark SQL Kafka 0.10 3.0.1

Description

When using a Kafka source and `groupByKey` followed by `flatMapGroupsWithState`, it appears that if a single group contains 500k+ records in a batch, the Iterator supplied to the matching function is not ordered by Kafka offset.

I know that semantically the groups themselves in `groupByKey` are not guaranteed to be ordered, but this seems extremely desirable in the case of Structured Streaming. It consumes a lot of memory to manually sort this Iterator inside of the mapping function and if ordering could be preserved more efficiently by reducing Kafka consumer parallelism, that would be preferable.

If this is considered normal behavior, I believe that the it should be more clear in the docs that ordering is not guaranteed within a batch, and I would appreciate some insight into how to avoid this behavior (I have been fiddling with `maxOffsetsPerTrigger` but I do not know exactly what size batch triggers the issue, and setting it too low could prevent the job from being able to keep up with the topic).

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Kevin Flansburg

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 19/Nov/20 05:21

Updated:: 19/Nov/20 05:21