Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-33484

Kafka flatMapGroupsWithState not always ordered.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 3.0.1
    • None
    • Structured Streaming
    • None
    • Scala 2.12

      Spark 3.0.1

      Hadoop 3.2.0

      Spark SQL Kafka 0.10 3.0.1

    Description

      When using a Kafka source and `groupByKey` followed by `flatMapGroupsWithState`, it appears that if a single group contains 500k+ records in a batch, the Iterator supplied to the matching function is not ordered by Kafka offset.

      I know that semantically the groups themselves in `groupByKey` are not guaranteed to be ordered, but this seems extremely desirable in the case of Structured Streaming. It consumes a lot of memory to manually sort this Iterator inside of the mapping function and if ordering could be preserved more efficiently by reducing Kafka consumer parallelism, that would be preferable.

      If this is considered normal behavior, I believe that the it should be more clear in the docs that ordering is not guaranteed within a batch, and I would appreciate some insight into how to avoid this behavior (I have been fiddling with `maxOffsetsPerTrigger` but I do not know exactly what size batch triggers the issue, and setting it too low could prevent the job from being able to keep up with the topic).

      Attachments

        Activity

          People

            Unassigned Unassigned
            kflansburg Kevin Flansburg
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: