[SPARK-35611] Introduce the strategy on mismatched offset for start offset timestamp on Kafka data source - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.0.2, 3.1.1
Fix Version/s: 3.2.0
Component/s: Structured Streaming
Labels:
None

Description

1. Rationalization

We encountered a real-world case Spark fails the query if some of the partitions don't have matching offset by timestamp.

This is intended behavior to avoid bring unintended output for some cases like:

timestamp 2 is presented as timestamp-offset, but the some of partitions don't have the record yet
record with timestamp 1 comes "later" in the following micro-batch

which is possible since Kafka allows to specify the timestamp in record.

Here the unintended output we talked about was the risk of reading record with timestamp 1 in the next micro-batch despite the option specifying timestamp 2.

But for many cases end users just suppose timestamp is increasing monotonically, and current behavior blocks these cases to make progress.

2. Proposal

For the cases the timestamp is supposed to increase monotonically, it's safe to consider the offset to be latest (technically, offset for latest record + 1) if there's no matching record via timestamp.

This would be pretty much helpful for the case where there's a skew between partitions and some partitions have older records.

AS-IS: Spark simply fails the query and end users have to deal with workarounds requiring manual steps.
TO-BE: Spark will assign the latest offset for these partitions, so that Spark can read newer records from these partitions in further micro-batches.

To retain the existing behavior and also give some help for the proposed "TO-BE" behavior, we'd like to introduce the strategy on mismatched offset for start offset timestamp.

Attachments

Issue Links

links to

[Github] Pull Request #32747 (HeartSaVioR)

[Github] Pull Request #33854 (xuanyuanking)

Activity

People

Assignee:: Jungtaek Lim

Reporter:: Jungtaek Lim

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 02/Jun/21 10:47

Updated:: 26/Aug/21 16:59

Resolved:: 21/Jun/21 07:38