[SPARK-24697] Fix the reported start offsets in streaming query progress - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3.1
Fix Version/s: 2.4.0
Component/s: Structured Streaming
Labels:
None

Description

Streaming query reports progress during each trigger (e.g. after runBatch in MicrobatchExcecution). However the reported progress has wrong offsets since the offsets are committed and committedOffsets is updated to the availableOffsets before the progress is reported.

This leads to weird progress where startOffset and endOffsets are always the same.

Sample output for Kafka source below. Here 11 rows are processed in the microbatch however the start and end offsets are same.

{
 "id" : "76bf5515-55be-46af-bc79-9fc92cc6d856",
 "runId" : "b526f0f4-24bf-4ddc-b6e8-7b0cc83bdbe8",
...
"sources" : [ {
 "description" : "KafkaV2[Subscribe[topic2]]",
 "startOffset" : {
 "topic2" : {
 "0" : 44
 }
 },
 "endOffset" : {
 "topic2" : {
 "0" : 44
 }
 },
 "numInputRows" : 11,
 "inputRowsPerSecond" : 1.099670098970309,
 "processedRowsPerSecond" : 1.8829168093118795
 } ],
...
}

Attachments

Issue Links

links to

[Github] Pull Request #21673 (arunmahadevan)

[Github] Pull Request #21744 (tdas)

Activity

People

Assignee:: Tathagata Das

Reporter:: Arun Mahadevan

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 29/Jun/18 21:17

Updated:: 13/Jul/18 05:26

Resolved:: 11/Jul/18 19:45