[MAPREDUCE-5501] RMContainer Allocator does not stop when cluster shutdown is performed in tests - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Component/s: resourcemanager
Labels:
None

Description

After running MR job client tests many MRAppMaster processes stay alive. The reason seems that RMContainer Allocator thread ignores InterruptedException and keeps retrying:

2013-09-09 18:52:07,505 WARN [RMCommunicator Allocator] org.apache.hadoop.util.ThreadUtil: interrupted while sleeping
java.lang.InterruptedException: sleep interrupted
        at java.lang.Thread.sleep(Native Method)
        at org.apache.hadoop.util.ThreadUtil.sleepAtLeastIgnoreInterrupts(ThreadUtil.java:43)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:149)
        at com.sun.proxy.$Proxy29.allocate(Unknown Source)
        at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.makeRemoteRequest(RMContainerRequestor.java:154)
        at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:553)
        at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:219)
        at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:236)
        at java.lang.Thread.run(Thread.java:680)
2013-09-09 18:52:37,639 INFO [RMCommunicator Allocator] org.apache.hadoop.ipc.Client: Retrying connect to server: dhcpx-197-141.corp.yahoo.com/10.73.197.141:61163. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2013-09-09 18:52:38,640 INFO [RMCommunicator Allocator] org.apache.hadoop.ipc.Client: Retrying connect to server: dhcpx-197-141.corp.yahoo.com/10.73.197.141:61163. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)

It takes > 6 minutes for the processes to die, and this causes various issues with tests which use the same DFS dir.

2013-09-09 22:26:47,179 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Error communicating with RM: Could not contact RM after 360000 milliseconds.
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Could not contact RM after 360000 milliseconds.
        at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:563)
        at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:219)
        at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:236)
        at java.lang.Thread.run(Thread.java:680)

Will attach a thread dump separately.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

hanging-rmcontainer-allocator.stdout
10/Sep/13 22:44
37 kB
Andrey Klochkov
hanging-rmcontainer-allocator.syslog
10/Sep/13 22:45
63 kB
Andrey Klochkov

Issue Links

relates to

YARN-1183 MiniYARNCluster shutdown takes several minutes intermittently

Closed

Activity

People

Assignee:: Andrey Klochkov

Reporter:: Andrey Klochkov

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 10/Sep/13 05:58

Updated:: 10/Mar/15 04:30

Resolved:: 11/Sep/13 23:02