Description
SPARK-17147 fixes a problem with non-consecutive Kafka Offsets. The PR was merged to master. This should be backported to 2.3.
Original Description from SPARK-17147 :
When Kafka does log compaction offsets often end up with gaps, meaning the next requested offset will be frequently not be offset+1. The logic in KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset will always be just an increment of 1 above the previous offset.
I have worked around this problem by changing CachedKafkaConsumer to use the returned record's offset, from:
nextOffset = offset + 1
to:
nextOffset = record.offset + 1
and changed KafkaRDD from:
requestOffset += 1
to:
requestOffset = r.offset() + 1
(I also had to change some assert logic in CachedKafkaConsumer).
There's a strong possibility that I have misconstrued how to use the streaming kafka consumer, and I'm happy to close this out if that's the case. If, however, it is supposed to support non-consecutive offsets (e.g. due to log compaction) I am also happy to contribute a PR.
Attachments
Issue Links
- is a clone of
-
SPARK-17147 Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)
-
- Resolved
-
- is related to
-
SPARK-23685 Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)
-
- Resolved
-
- links to