[SPARK-19233] Inconsistent Behaviour of Spark Streaming Checkpoint - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.0.0, 2.0.1, 2.0.2, 2.1.0
Fix Version/s: None
Component/s: DStreams
Labels:
- bulk-closed

Description

When checking one of our application logs, we found the following behavior (simplified)

1. Spark application recovers from checkpoint constructed at timestamp 1000ms

2. The log shows that Spark application can recover RDDs generated at timestamp 2000, 3000

The root cause is that generateJobs event is pushed to the queue by a separate thread (RecurTimer), before doCheckpoint event is pushed to the queue, there might have been multiple generatedJobs being processed. As a result, when doCheckpoint for timestamp 1000 is processed, the generatedRDDs data structure containing RDDs generated at 2000, 3000 is serialized as part of checkpoint of 1000.

It brings overhead for debugging and coordinate our offset management strategy with Spark Streaming's checkpoint strategy when we are developing a new type of DStream which integrates Spark Streaming with an internal message middleware.

The proposed fix is to filter generatedRDDs according to checkpoint timestamp when serializing it.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Nan Zhu

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 16/Jan/17 01:34

Updated:: 21/May/19 04:16

Resolved:: 21/May/19 04:16