Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24697

Fix the reported start offsets in streaming query progress

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.1
    • 2.4.0
    • Structured Streaming
    • None

    Description

      Streaming query reports progress during each trigger (e.g. after runBatch in MicrobatchExcecution). However the reported progress has wrong offsets since the offsets are committed and committedOffsets is updated to the availableOffsets before the progress is reported.

      This leads to weird progress where startOffset and endOffsets are always the same.

      Sample output for Kafka source below. Here 11 rows are processed in the microbatch however the start and end offsets are same.

       

      {
       "id" : "76bf5515-55be-46af-bc79-9fc92cc6d856",
       "runId" : "b526f0f4-24bf-4ddc-b6e8-7b0cc83bdbe8",
      ...
      "sources" : [ {
       "description" : "KafkaV2[Subscribe[topic2]]",
       "startOffset" : {
       "topic2" : {
       "0" : 44
       }
       },
       "endOffset" : {
       "topic2" : {
       "0" : 44
       }
       },
       "numInputRows" : 11,
       "inputRowsPerSecond" : 1.099670098970309,
       "processedRowsPerSecond" : 1.8829168093118795
       } ],
      ...
      }
      

       

      Attachments

        Activity

          People

            tdas Tathagata Das
            arunmahadevan Arun Mahadevan
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: