Description
There are some problems in the current mechanism:
- If a task fails more than “spark.task.maxFailures” (default: 4) times, the job will fail. For a long-running Streaming applications, it’s possible that a Receiver task fails more than 4 times because of Executor lost.
- When an executor is lost, the Receiver tasks on it will be rescheduled. However, because there may be many Spark jobs at the same time, it’s possible that TaskScheduler cannot schedule them to make Receivers be distributed evenly.
To solve such limitations, we need to change the receiver scheduling mechanism. Here is the design doc: https://docs.google.com/document/d/1ZsoRvHjpISPrDmSjsGzuSu8UjwgbtmoCTzmhgTurHJw/edit?usp=sharing
Attachments
Issue Links
- supercedes
-
SPARK-3283 Receivers sometimes do not get spread out to multiple nodes
- Resolved
-
SPARK-7942 Receiver's life cycle is inconsistent with streaming job.
- Resolved
- links to