Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-4545

If first Spark Streaming batch fails, it waits 10x batch duration before stopping

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 1.1.0, 1.2.1
    • None
    • DStreams

    Description

      (I'd like to track the issue raised at http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3CCAMAsSdKY=QCT0YUdrkvbVuqXdFCGp1+6g-=s71Fk8ZR4uaTK7g@mail.gmail.com%3E as a JIRA since I think it's a legitimate issue that I can take a look into, with some help.)

      This bit of JobGenerator.stop() executes, since the message appears in the logs:

      def haveAllBatchesBeenProcessed = {
        lastProcessedBatch != null && lastProcessedBatch.milliseconds == stopTime
      }
      logInfo("Waiting for jobs to be processed and checkpoints to be written")
      while (!hasTimedOut && !haveAllBatchesBeenProcessed) {
        Thread.sleep(pollTime)
      }
      
      // ... 10x batch duration wait here, before seeing the next line log:
      
      logInfo("Waited for jobs to be processed and checkpoints to be written")
      

      I think that lastProcessedBatch is always null since no batch ever
      succeeds. Of course, for all this code knows, the next batch might
      succeed and so is there waiting for it. But it should proceed after
      one more batch completes, even if it failed?

      JobGenerator.onBatchCompleted is only called for a successful batch.
      Can it be called if it fails too? I think that would fix it.

      Should the condition also not be {{lastProcessedBatch.milliseconds <=
      stopTime}} instead of == ?

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              srowen Sean R. Owen
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: