There are some problems in the current mechanism:
- If a task fails more than “spark.task.maxFailures” (default: 4) times, the job will fail. For a long-running Streaming applications, it’s possible that a Receiver task fails more than 4 times because of Executor lost.
- When an executor is lost, the Receiver tasks on it will be rescheduled. However, because there may be many Spark jobs at the same time, it’s possible that TaskScheduler cannot schedule them to make Receivers be distributed evenly.
To solve such limitations, we need to change the receiver scheduling mechanism. Here is the design doc: https://docs.google.com/document/d/1ZsoRvHjpISPrDmSjsGzuSu8UjwgbtmoCTzmhgTurHJw/edit?usp=sharing