When checking one of our application logs, we found the following behavior (simplified)
1. Spark application recovers from checkpoint constructed at timestamp 1000ms
2. The log shows that Spark application can recover RDDs generated at timestamp 2000, 3000
The root cause is that generateJobs event is pushed to the queue by a separate thread (RecurTimer), before doCheckpoint event is pushed to the queue, there might have been multiple generatedJobs being processed. As a result, when doCheckpoint for timestamp 1000 is processed, the generatedRDDs data structure containing RDDs generated at 2000, 3000 is serialized as part of checkpoint of 1000.
It brings overhead for debugging and coordinate our offset management strategy with Spark Streaming's checkpoint strategy when we are developing a new type of DStream which integrates Spark Streaming with an internal message middleware.
The proposed fix is to filter generatedRDDs according to checkpoint timestamp when serializing it.