Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19233

Inconsistent Behaviour of Spark Streaming Checkpoint

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.0.0, 2.0.1, 2.0.2, 2.1.0
    • None
    • DStreams

    Description

      When checking one of our application logs, we found the following behavior (simplified)

      1. Spark application recovers from checkpoint constructed at timestamp 1000ms

      2. The log shows that Spark application can recover RDDs generated at timestamp 2000, 3000

      The root cause is that generateJobs event is pushed to the queue by a separate thread (RecurTimer), before doCheckpoint event is pushed to the queue, there might have been multiple generatedJobs being processed. As a result, when doCheckpoint for timestamp 1000 is processed, the generatedRDDs data structure containing RDDs generated at 2000, 3000 is serialized as part of checkpoint of 1000.

      It brings overhead for debugging and coordinate our offset management strategy with Spark Streaming's checkpoint strategy when we are developing a new type of DStream which integrates Spark Streaming with an internal message middleware.

      The proposed fix is to filter generatedRDDs according to checkpoint timestamp when serializing it.

      Attachments

        Activity

          People

            Unassigned Unassigned
            codingcat Nan Zhu
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: