Description
The job was submitted totally 4 retriesIn each retry, most of the Jobs can finish data downloading/deserialization within 6-30 minutes. There are about 3 evaluators which are very slow. The slowest one took about 2-8 hours to download data/deserialization in each retry. The retry was triggered after 30 min timeout (configurable)Driver cannot send close event to those slower evaluators before they complete data loading and then send IRunningTask event to driver. After long running time, the Job was killed.
A simple band-aid is to kill the evaluators from which we do not receive RunningTask after the 30 min timeout along with cancelling the RunningTasks that have been received. Its needless to wait 8 hours to cancel the RunningTasks that just complete downloading/deserializing the data.