Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24720

kafka transaction creates Non-consecutive Offsets (due to transaction offset) making streaming fail when failOnDataLoss=true

    Details

    • Type: Bug
    • Status: In Progress
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.3.1
    • Fix Version/s: None
    • Component/s: DStreams
    • Labels:
      None

      Description

      When kafka transactions are used, sending 1 message to kafka will result to 1 offset for the data + 1 offset to mark the transaction.

      When kafka connector for spark streaming read a topic with non-consecutive offset, it leads to a failure. SPARK-17147 fixed this issue for compacted topics.
      However, SPARK-17147 doesn't fix this issue for kafka transactions: if 1 message + 1 transaction commit are in a partition, spark will try to read offsets  [0 2[. offset 0 (containing the message) will be read, but offset 1 won't return a value and buffer.hasNext() will be false even after a poll since no data are present for offset 1 (it's the transaction commit)

       

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              qambard Quentin Ambard
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated: