Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
-
Reviewed
Description
In ContainerImpl.getLocalizedResources(), there's:
assert ContainerState.LOCALIZED == getContainerState(); // TODO: FIXME!!
ContainerImpl.getLocalizedResources() is called in ContainerLaunch.call(), which is scheduled on a separate thread. If the container is not at LOCALIZED (e.g. it is at KILLING, see YARN-906), an AssertError will be thrown and fails the thread without notifying NM. Therefore, the container cannot receive more events, which are supposed to be sent from ContainerLaunch.call(), and move towards completion.
Attachments
Attachments
- YARN-966.1.patch
- 14 kB
- Zhijie Shen
Issue Links
- relates to
-
YARN-906 Cancelling ContainerLaunch#call at KILLING causes that the container cannot be completed
- Closed
Activity
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12593887/YARN-966.1.patch
against trunk revision .
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 2 new or modified test files.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1564//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1564//console
This message is automatically generated.
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12593887/YARN-966.1.patch
against trunk revision .
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 2 new or modified test files.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1572//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1572//console
This message is automatically generated.
few questions..
- assert ContainerState.LOCALIZED == getContainerState(); // TODO: FIXME!! - return localizedResources; + if (ContainerState.LOCALIZED == getContainerState()) { + return localizedResources; + } else { + return null; + }
after this change are we saying that container's partial localized resources will not be accessible during the state other than LOCALIZED and we will return null??
SUCCESS: Integrated in Hadoop-trunk-Commit #4190 (See https://builds.apache.org/job/Hadoop-trunk-Commit/4190/)
YARN-966. Fixed ContainerLaunch to not fail quietly when there are no localized resources due to some other failure. Contributed by Zhijie Shen. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1508688)
- /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
- /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java
- /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java
- /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/TestContainer.java
- /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/TestContainerLaunch.java
- Also I think we are mixing
YARN-966withYARN-906. I don't see any point why we should return null...we can return empty map if no resources are localized.. thoughts?private final Map<Path,List<String>> localizedResources = new HashMap<Path,List<String>>();
- I guess we should not fix that TODO. that has nothing to do with getLocalizedResources() call. Probably caller can make getState() call and figure out in what state container is....thoughts?
Committed this to trunk, branch-2 and branch-2.1. Thanks Zhijie!
after this change are we saying that container's partial localized resources will not be accessible during the state other than LOCALIZED and we will return null??
There is only one user of the API today. We can change it in future.
Also I think we are mixing
YARN-966withYARN-906. I don't see any point why we should return null...we can return empty map if no resources are localized.. thoughts?
We aren't mixing YARN-966. We are just fixing an assert to be explicitly checked so that IN case something happens, we can report properly. In all my understanding, this code path should never get triggered.
Also I think we are mixing
YARN-966withYARN-906. I don't see any point why we should return null...we can return empty map if no resources are localized.. thoughts?
One more consideration. Empty map can means the case that the container is at LOCALIZED, but actually there's no localized resources. Returning null is to distinguish this case with the case of fetch the localized resources when the container is not at LOCALIZED.
SUCCESS: Integrated in Hadoop-Yarn-trunk #287 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/287/)
YARN-966. Fixed ContainerLaunch to not fail quietly when there are no localized resources due to some other failure. Contributed by Zhijie Shen. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1508688)
- /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
- /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java
- /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java
- /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/TestContainer.java
- /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/TestContainerLaunch.java
FAILURE: Integrated in Hadoop-Hdfs-trunk #1477 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1477/)
YARN-966. Fixed ContainerLaunch to not fail quietly when there are no localized resources due to some other failure. Contributed by Zhijie Shen. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1508688)
- /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
- /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java
- /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java
- /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/TestContainer.java
- /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/TestContainerLaunch.java
SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1504 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1504/)
YARN-966. Fixed ContainerLaunch to not fail quietly when there are no localized resources due to some other failure. Contributed by Zhijie Shen. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1508688)
- /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
- /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java
- /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java
- /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/TestContainer.java
- /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/TestContainerLaunch.java
One more consideration. Empty map can means the case that the container is at LOCALIZED, but actually there's no localized resources. Returning null is to distinguish this case with the case of fetch the localized resources when the container is not at LOCALIZED.
This assumption is wrong. State of container has nothing to do with localized resources map. we can call getState and know its state irrespective of this null check. Potentially I don't see when we will in fact start ContainerLaunch#call without its all resources getting downloaded. Its different issue that user may kill the container resulting into a state transition. This I still see should not be done via NULL check. Proper way is to set boolean flag of ContainerLaunch in the event of KILL synchronously.
So if user say kills the container what error we will see is
+ RPCUtil.getRemoteException( + "Unable to get local resources when Container " + containerID + + " is at " + container.getContainerState());
which is completely misleading.. Indeed this occurred because user killed container not because it failed to localize resources.
Potentially I don't see when we will in fact start ContainerLaunch#call without its all resources getting downloaded.
This is the most important point.
This I still see should not be done via NULL check. Proper way is to set boolean flag of ContainerLaunch in the event of KILL synchronously.
which is completely misleading.. Indeed this occurred because user killed container not because it failed to localize resources.
I think we are beating this down to death. Like I said, this error SHOULD NOT happen in practice. I don't know whey the assert was originally put in place. That said, I didn't want to blindly remove it without knowing why it was there to begin with. If ever we run into this in real life, we can fix the message.
Potentially I don't see when we will in fact start ContainerLaunch#call without its all resources getting downloaded.
YARN-906 is such a corner case.
This I still see should not be done via NULL check. Proper way is to set boolean flag of ContainerLaunch in the event of KILL synchronously.
The original code checks state == LOCALIZED, and throws AssertError when getting the localized resources. I just modified the way to indicate the error, such that the callers of it can more easily handle the error. If you think calling getLocalizedResources() when the container is not at LOCALIZED is not wrong, I'm afraid we're in the different conversation.
which is completely misleading.. Indeed this occurred because user killed container not because it failed to localize resources.
I don't think the message is misleading. Again, getLocalizedResources() is not allowed to be called when the container is not at LOCALIZED (at least the original code means it). So the message clearly states problem. Please note that killing signal is not the root problem of the thread failure here. If getLocalizedResources() were not called, the thread would still complete without exception.
Created a patch:
1. ContainerImpl#getLocalizedResources returns null when it's not at LOCALIZED
2. ContainerLaunch#call checks null after calling ContainerImpl#getLocalizedResources. If null, it exits the thread, and sends CONTAINER_EXITED_WITH_FAILURE.
3. Corresponding test cases are added.