Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-4964

Exactly-once + WAL-free Kafka Support in Spark Streaming

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3.0
    • Component/s: DStreams
    • Labels:
      None

      Description

      There are two issues with the current Kafka support

      We want to solve both these problem in JIRA. Please see the following design doc for the solution.
      https://docs.google.com/a/databricks.com/document/d/1IuvZhg9cOueTf1mq4qwc1fhPb5FVcaRLcyjrtG4XU1k/edit#heading=h.itproy77j3p

        Issue Links

          Activity

          Hide
          apachespark Apache Spark added a comment -

          User 'koeninger' has created a pull request for this issue:
          https://github.com/apache/spark/pull/3798

          Show
          apachespark Apache Spark added a comment - User 'koeninger' has created a pull request for this issue: https://github.com/apache/spark/pull/3798
          Hide
          cody@koeninger.org Cody Koeninger added a comment -

          Usage example of the dstream for the transactional and idempotent cases:

          https://github.com/koeninger/kafka-exactly-once/tree/master

          Show
          cody@koeninger.org Cody Koeninger added a comment - Usage example of the dstream for the transactional and idempotent cases: https://github.com/koeninger/kafka-exactly-once/tree/master
          Show
          cody@koeninger.org Cody Koeninger added a comment - Design doc at https://docs.google.com/a/databricks.com/document/d/1IuvZhg9cOueTf1mq4qwc1fhPb5FVcaRLcyjrtG4XU1k/edit?usp=sharing
          Hide
          tdas Tathagata Das added a comment - - edited

          I am renaming this JIRA to "Exactly-once + WAL-free Kafka Support in Spark Streaming" because there are two problems that we are trying to solve, which gets solved by the associated PR. See the design doc for more details.

          Also, I updated the description to reflect the two issues, and added references to the design doc.

          Show
          tdas Tathagata Das added a comment - - edited I am renaming this JIRA to "Exactly-once + WAL-free Kafka Support in Spark Streaming" because there are two problems that we are trying to solve, which gets solved by the associated PR. See the design doc for more details. Also, I updated the description to reflect the two issues, and added references to the design doc.
          Hide
          tdas Tathagata Das added a comment -

          Dibyendu BhattacharyaSaisai ShaoHari ShreedharanCody Koeninger
          Please take a look at the design doc and comment on it.
          Thank you very much!

          Show
          tdas Tathagata Das added a comment - Dibyendu Bhattacharya Saisai Shao Hari Shreedharan Cody Koeninger Please take a look at the design doc and comment on it. Thank you very much!
          Hide
          vincentye38 vincent ye added a comment -

          I have pretty much the same idea as mentioned in Tathagata's design doc. I prototyped it on top of tresata/spark-kafka project. Here is the code
          https://github.com/vincentye38/spark-kafka/tree/InputDStream_updateStateByKey.
          I use StateDStream to checkpoint the offsets since generatedRDDs member variable and clearMetadata() method of DStream are not accessible from its subclasses.
          I have run it on the staging environment of my company for a week. It can recovery from restarting.

          Show
          vincentye38 vincent ye added a comment - I have pretty much the same idea as mentioned in Tathagata's design doc. I prototyped it on top of tresata/spark-kafka project. Here is the code https://github.com/vincentye38/spark-kafka/tree/InputDStream_updateStateByKey . I use StateDStream to checkpoint the offsets since generatedRDDs member variable and clearMetadata() method of DStream are not accessible from its subclasses. I have run it on the staging environment of my company for a week. It can recovery from restarting.
          Hide
          apachespark Apache Spark added a comment -

          User 'tdas' has created a pull request for this issue:
          https://github.com/apache/spark/pull/4384

          Show
          apachespark Apache Spark added a comment - User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/4384
          Hide
          apachespark Apache Spark added a comment -

          User 'koeninger' has created a pull request for this issue:
          https://github.com/apache/spark/pull/4511

          Show
          apachespark Apache Spark added a comment - User 'koeninger' has created a pull request for this issue: https://github.com/apache/spark/pull/4511

            People

            • Assignee:
              cody@koeninger.org Cody Koeninger
              Reporter:
              cody@koeninger.org Cody Koeninger
            • Votes:
              0 Vote for this issue
              Watchers:
              20 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development