Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-35611

Introduce the strategy on mismatched offset for start offset timestamp on Kafka data source

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.2, 3.1.1
    • 3.2.0
    • Structured Streaming
    • None

    Description

      1. Rationalization

      We encountered a real-world case Spark fails the query if some of the partitions don't have matching offset by timestamp.

      This is intended behavior to avoid bring unintended output for some cases like:

      • timestamp 2 is presented as timestamp-offset, but the some of partitions don't have the record yet
      • record with timestamp 1 comes "later" in the following micro-batch

      which is possible since Kafka allows to specify the timestamp in record.

      Here the unintended output we talked about was the risk of reading record with timestamp 1 in the next micro-batch despite the option specifying timestamp 2.

      But for many cases end users just suppose timestamp is increasing monotonically, and current behavior blocks these cases to make progress.

      2. Proposal

      For the cases the timestamp is supposed to increase monotonically, it's safe to consider the offset to be latest (technically, offset for latest record + 1) if there's no matching record via timestamp.

      This would be pretty much helpful for the case where there's a skew between partitions and some partitions have older records.

      • AS-IS: Spark simply fails the query and end users have to deal with workarounds requiring manual steps.
      • TO-BE: Spark will assign the latest offset for these partitions, so that Spark can read newer records from these partitions in further micro-batches.

      To retain the existing behavior and also give some help for the proposed "TO-BE" behavior, we'd like to introduce the strategy on mismatched offset for start offset timestamp.

      Attachments

        Activity

          People

            kabhwan Jungtaek Lim
            kabhwan Jungtaek Lim
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: