Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23685

Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Information Provided
    • Affects Version/s: 2.2.0
    • Fix Version/s: None
    • Component/s: Structured Streaming
    • Labels:
      None

      Description

      When Kafka does log compaction offsets often end up with gaps, meaning the next requested offset will be frequently not be offset+1. The logic in KafkaSourceRDD & CachedKafkaConsumer assumes that the next offset will always be just an increment of 1 .If not, it throws the below exception:

       

      "Cannot fetch records in [5589, 5693) (GroupId: XXX, TopicPartition:XXXX). Some data may have been lost because they are not available in Kafka any more; either the data was aged out by Kafka or the topic may have been deleted before all the data in the topic was processed. If you don't want your streaming query to fail on such cases, set the source option "failOnDataLoss" to "false". "

       

      FYI: This bug is related to https://issues.apache.org/jira/browse/SPARK-17147

       

       

        Attachments

        Issue Links

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              sindiri sirisha

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment