Description
In a gridmix run with ~1000 jobs, one job is getting stuck because of 2-3 hanging reducers. All of the them are stuck after downloading all map outputs and have the following thread dump.
"EventFetcher for fetching Map Completion Events" daemon prio=10 tid=0xa325fc00 nid=0x1ca4 waiting on condition [0xa315c000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.mapreduce.task.reduce.EventFetcher.run(EventFetcher.java:71) "main" prio=10 tid=0x080ed400 nid=0x1c71 in Object.wait() [0xf73a2000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0xa94b23d8> (a org.apache.hadoop.mapreduce.task.reduce.EventFetcher) at java.lang.Thread.join(Thread.java:1143) - locked <0xa94b23d8> (a org.apache.hadoop.mapreduce.task.reduce.EventFetcher) at java.lang.Thread.join(Thread.java:1196) at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:135) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:367) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:147) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1135) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:142)
Thanks to karams for helping track this down.