Details
-
Bug
-
Status: In Progress
-
Major
-
Resolution: Unresolved
-
3.0.0
-
None
-
None
-
Fix batch submission delay caused by RDD actions in DStream.transfrom
Description
Issue
Some spark applications have batch creation delay after running for some time. For instance, Batch 10:03 is submitted at 10:06. In spark UI, the latest batch doesn't match current time.
Clock | BatchTime |
---|---|
10:00 | 10:00 |
10:02 | 10:01 |
10:04 | 10:02 |
10:06 | 10:03 |
Investigation
We observe such applications share a commonality that rdd actions exist in dstream.transfrom. Those actions will be executed in dstream.compute, which is called by JobGenerator. JobGenerator runs with a single thread event loop so any synchronized operations will block event processing.
Proposal
delegate dstream.compute to JobSchduler
// class ForEachDStream override def generateJob(time: Time): Option[Job] = { val jobFunc = () => createRDDWithLocalProperties(time, displayInnerRDDOps) { parent.getOrCompute(time).foreach(rdd => foreachFunc(rdd, time)) } Some(new Job(time, jobFunc)) }
Attachments
Issue Links
- links to