Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18905

Potential Issue of Semantics of BatchCompleted

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.5.1, 1.5.2, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2
    • 2.1.1, 2.2.0
    • DStreams
    • None

    Description

      the current implementation of Spark streaming considers a batch is completed no matter the results of the jobs (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203)

      Let's consider the following case:

      A micro batch contains 2 jobs and they read from two different kafka topics respectively. One of these jobs is failed due to some problem in the user defined logic, after the other one is finished successfully.

      1. The main thread in the Spark streaming application will execute the line mentioned above,

      2. and another thread (checkpoint writer) will make a checkpoint file immediately after this line is executed.

      3. Then due to the current error handling mechanism in Spark Streaming, StreamingContext will be closed (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214)

      the user recovers from the checkpoint file, and because the JobSet containing the failed job has been removed (taken as completed) before the checkpoint is constructed, the data being processed by the failed job would never be reprocessed?

      I might have missed something in the checkpoint thread or this handleJobCompletion()....or it is a potential bug

      Attachments

        Activity

          People

            codingcat Nan Zhu
            codingcat Nan Zhu
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: