Affects Version/s: 3.1.2
Fix Version/s: None
Attempt(map or reduce) remains NEW(state). And job is stuck in certain conditions.
The following are the situations:
- total task(map/reduce) count is same as the running limit of task(mapreduce.job.running.map.limit/mapreduce.job.running.reduce.limit).
- And start job. -> And total tasks(map/reduce) are running. -> And failed attempt for some reasons.
- Request allocation of new containers because the attempt failed.
- Quickly receive allocation of new containers.
- However, new container is released because failed attempts have not been cleared up.(allocated == total == running limit)
- Subsequently, the failed attempts is terminated, but it is waiting forever.
- Job is stuck.
We switched MR frameworks(2.7.1) and checked that it worked well.
Perhaps it is related to
Can you help me?