Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-966

The thread of ContainerLaunch#call will fail without any signal if getLocalizedResources() is called when the container is not at LOCALIZED

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 2.1.1-beta
    • None
    • None
    • Reviewed

    Description

      In ContainerImpl.getLocalizedResources(), there's:

      assert ContainerState.LOCALIZED == getContainerState(); // TODO: FIXME!!
      

      ContainerImpl.getLocalizedResources() is called in ContainerLaunch.call(), which is scheduled on a separate thread. If the container is not at LOCALIZED (e.g. it is at KILLING, see YARN-906), an AssertError will be thrown and fails the thread without notifying NM. Therefore, the container cannot receive more events, which are supposed to be sent from ContainerLaunch.call(), and move towards completion.

      Attachments

        1. YARN-966.1.patch
          14 kB
          Zhijie Shen

        Issue Links

          Activity

            zjshen Zhijie Shen added a comment -

            Created a patch:

            1. ContainerImpl#getLocalizedResources returns null when it's not at LOCALIZED

            2. ContainerLaunch#call checks null after calling ContainerImpl#getLocalizedResources. If null, it exits the thread, and sends CONTAINER_EXITED_WITH_FAILURE.

            3. Corresponding test cases are added.

            zjshen Zhijie Shen added a comment - Created a patch: 1. ContainerImpl#getLocalizedResources returns null when it's not at LOCALIZED 2. ContainerLaunch#call checks null after calling ContainerImpl#getLocalizedResources. If null, it exits the thread, and sends CONTAINER_EXITED_WITH_FAILURE. 3. Corresponding test cases are added.
            hadoopqa Hadoop QA added a comment -

            +1 overall. Here are the results of testing the latest attachment
            http://issues.apache.org/jira/secure/attachment/12593887/YARN-966.1.patch
            against trunk revision .

            +1 @author. The patch does not contain any @author tags.

            +1 tests included. The patch appears to include 2 new or modified test files.

            +1 javac. The applied patch does not increase the total number of javac compiler warnings.

            +1 javadoc. The javadoc tool did not generate any warning messages.

            +1 eclipse:eclipse. The patch built with eclipse:eclipse.

            +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

            +1 release audit. The applied patch does not increase the total number of release audit warnings.

            +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

            +1 contrib tests. The patch passed contrib unit tests.

            Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1564//testReport/
            Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1564//console

            This message is automatically generated.

            hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12593887/YARN-966.1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1564//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1564//console This message is automatically generated.
            hadoopqa Hadoop QA added a comment -

            +1 overall. Here are the results of testing the latest attachment
            http://issues.apache.org/jira/secure/attachment/12593887/YARN-966.1.patch
            against trunk revision .

            +1 @author. The patch does not contain any @author tags.

            +1 tests included. The patch appears to include 2 new or modified test files.

            +1 javac. The applied patch does not increase the total number of javac compiler warnings.

            +1 javadoc. The javadoc tool did not generate any warning messages.

            +1 eclipse:eclipse. The patch built with eclipse:eclipse.

            +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

            +1 release audit. The applied patch does not increase the total number of release audit warnings.

            +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

            +1 contrib tests. The patch passed contrib unit tests.

            Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1572//testReport/
            Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1572//console

            This message is automatically generated.

            hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12593887/YARN-966.1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1572//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1572//console This message is automatically generated.

            Looks straight-forward. +1. Checking this in.

            vinodkv Vinod Kumar Vavilapalli added a comment - Looks straight-forward. +1. Checking this in.

            few questions..

            -    assert ContainerState.LOCALIZED == getContainerState(); // TODO: FIXME!!
            -    return localizedResources;
            +      if (ContainerState.LOCALIZED == getContainerState()) {
            +        return localizedResources;
            +      } else {
            +        return null;
            +      }
            

            after this change are we saying that container's partial localized resources will not be accessible during the state other than LOCALIZED and we will return null??

            ojoshi Omkar Vinit Joshi added a comment - few questions.. - assert ContainerState.LOCALIZED == getContainerState(); // TODO: FIXME!! - return localizedResources; + if (ContainerState.LOCALIZED == getContainerState()) { + return localizedResources; + } else { + return null ; + } after this change are we saying that container's partial localized resources will not be accessible during the state other than LOCALIZED and we will return null??
            hudson Hudson added a comment -

            SUCCESS: Integrated in Hadoop-trunk-Commit #4190 (See https://builds.apache.org/job/Hadoop-trunk-Commit/4190/)
            YARN-966. Fixed ContainerLaunch to not fail quietly when there are no localized resources due to some other failure. Contributed by Zhijie Shen. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1508688)

            • /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
            • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java
            • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java
            • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/TestContainer.java
            • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/TestContainerLaunch.java
            hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #4190 (See https://builds.apache.org/job/Hadoop-trunk-Commit/4190/ ) YARN-966 . Fixed ContainerLaunch to not fail quietly when there are no localized resources due to some other failure. Contributed by Zhijie Shen. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1508688 ) /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/TestContainer.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/TestContainerLaunch.java
            • Also I think we are mixing YARN-966 with YARN-906. I don't see any point why we should return null...we can return empty map if no resources are localized.. thoughts?
                private final Map<Path,List<String>> localizedResources =
                  new HashMap<Path,List<String>>();
              
            • I guess we should not fix that TODO. that has nothing to do with getLocalizedResources() call. Probably caller can make getState() call and figure out in what state container is....thoughts?
            ojoshi Omkar Vinit Joshi added a comment - Also I think we are mixing YARN-966 with YARN-906 . I don't see any point why we should return null...we can return empty map if no resources are localized.. thoughts? private final Map<Path,List< String >> localizedResources = new HashMap<Path,List< String >>(); I guess we should not fix that TODO. that has nothing to do with getLocalizedResources() call. Probably caller can make getState() call and figure out in what state container is....thoughts?

            Committed this to trunk, branch-2 and branch-2.1. Thanks Zhijie!

            vinodkv Vinod Kumar Vavilapalli added a comment - Committed this to trunk, branch-2 and branch-2.1. Thanks Zhijie!

            after this change are we saying that container's partial localized resources will not be accessible during the state other than LOCALIZED and we will return null??

            There is only one user of the API today. We can change it in future.

            Also I think we are mixing YARN-966 with YARN-906. I don't see any point why we should return null...we can return empty map if no resources are localized.. thoughts?

            We aren't mixing YARN-966. We are just fixing an assert to be explicitly checked so that IN case something happens, we can report properly. In all my understanding, this code path should never get triggered.

            vinodkv Vinod Kumar Vavilapalli added a comment - after this change are we saying that container's partial localized resources will not be accessible during the state other than LOCALIZED and we will return null?? There is only one user of the API today. We can change it in future. Also I think we are mixing YARN-966 with YARN-906 . I don't see any point why we should return null...we can return empty map if no resources are localized.. thoughts? We aren't mixing YARN-966 . We are just fixing an assert to be explicitly checked so that IN case something happens, we can report properly. In all my understanding, this code path should never get triggered.
            zjshen Zhijie Shen added a comment -

            Also I think we are mixing YARN-966 with YARN-906. I don't see any point why we should return null...we can return empty map if no resources are localized.. thoughts?

            One more consideration. Empty map can means the case that the container is at LOCALIZED, but actually there's no localized resources. Returning null is to distinguish this case with the case of fetch the localized resources when the container is not at LOCALIZED.

            zjshen Zhijie Shen added a comment - Also I think we are mixing YARN-966 with YARN-906 . I don't see any point why we should return null...we can return empty map if no resources are localized.. thoughts? One more consideration. Empty map can means the case that the container is at LOCALIZED, but actually there's no localized resources. Returning null is to distinguish this case with the case of fetch the localized resources when the container is not at LOCALIZED.
            hudson Hudson added a comment -

            SUCCESS: Integrated in Hadoop-Yarn-trunk #287 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/287/)
            YARN-966. Fixed ContainerLaunch to not fail quietly when there are no localized resources due to some other failure. Contributed by Zhijie Shen. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1508688)

            • /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
            • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java
            • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java
            • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/TestContainer.java
            • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/TestContainerLaunch.java
            hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk #287 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/287/ ) YARN-966 . Fixed ContainerLaunch to not fail quietly when there are no localized resources due to some other failure. Contributed by Zhijie Shen. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1508688 ) /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/TestContainer.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/TestContainerLaunch.java
            hudson Hudson added a comment -

            FAILURE: Integrated in Hadoop-Hdfs-trunk #1477 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1477/)
            YARN-966. Fixed ContainerLaunch to not fail quietly when there are no localized resources due to some other failure. Contributed by Zhijie Shen. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1508688)

            • /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
            • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java
            • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java
            • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/TestContainer.java
            • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/TestContainerLaunch.java
            hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #1477 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1477/ ) YARN-966 . Fixed ContainerLaunch to not fail quietly when there are no localized resources due to some other failure. Contributed by Zhijie Shen. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1508688 ) /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/TestContainer.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/TestContainerLaunch.java
            hudson Hudson added a comment -

            SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1504 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1504/)
            YARN-966. Fixed ContainerLaunch to not fail quietly when there are no localized resources due to some other failure. Contributed by Zhijie Shen. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1508688)

            • /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
            • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java
            • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java
            • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/TestContainer.java
            • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/TestContainerLaunch.java
            hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1504 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1504/ ) YARN-966 . Fixed ContainerLaunch to not fail quietly when there are no localized resources due to some other failure. Contributed by Zhijie Shen. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1508688 ) /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/TestContainer.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/TestContainerLaunch.java

            One more consideration. Empty map can means the case that the container is at LOCALIZED, but actually there's no localized resources. Returning null is to distinguish this case with the case of fetch the localized resources when the container is not at LOCALIZED.

            This assumption is wrong. State of container has nothing to do with localized resources map. we can call getState and know its state irrespective of this null check. Potentially I don't see when we will in fact start ContainerLaunch#call without its all resources getting downloaded. Its different issue that user may kill the container resulting into a state transition. This I still see should not be done via NULL check. Proper way is to set boolean flag of ContainerLaunch in the event of KILL synchronously.

            ojoshi Omkar Vinit Joshi added a comment - One more consideration. Empty map can means the case that the container is at LOCALIZED, but actually there's no localized resources. Returning null is to distinguish this case with the case of fetch the localized resources when the container is not at LOCALIZED. This assumption is wrong. State of container has nothing to do with localized resources map. we can call getState and know its state irrespective of this null check. Potentially I don't see when we will in fact start ContainerLaunch#call without its all resources getting downloaded. Its different issue that user may kill the container resulting into a state transition. This I still see should not be done via NULL check. Proper way is to set boolean flag of ContainerLaunch in the event of KILL synchronously.

            So if user say kills the container what error we will see is

            +        RPCUtil.getRemoteException(
            +            "Unable to get local resources when Container " + containerID +
            +            " is at " + container.getContainerState());
            

            which is completely misleading.. Indeed this occurred because user killed container not because it failed to localize resources.

            ojoshi Omkar Vinit Joshi added a comment - So if user say kills the container what error we will see is + RPCUtil.getRemoteException( + "Unable to get local resources when Container " + containerID + + " is at " + container.getContainerState()); which is completely misleading.. Indeed this occurred because user killed container not because it failed to localize resources.

            Potentially I don't see when we will in fact start ContainerLaunch#call without its all resources getting downloaded.

            This is the most important point.

            This I still see should not be done via NULL check. Proper way is to set boolean flag of ContainerLaunch in the event of KILL synchronously.

            which is completely misleading.. Indeed this occurred because user killed container not because it failed to localize resources.

            I think we are beating this down to death. Like I said, this error SHOULD NOT happen in practice. I don't know whey the assert was originally put in place. That said, I didn't want to blindly remove it without knowing why it was there to begin with. If ever we run into this in real life, we can fix the message.

            vinodkv Vinod Kumar Vavilapalli added a comment - Potentially I don't see when we will in fact start ContainerLaunch#call without its all resources getting downloaded. This is the most important point. This I still see should not be done via NULL check. Proper way is to set boolean flag of ContainerLaunch in the event of KILL synchronously. which is completely misleading.. Indeed this occurred because user killed container not because it failed to localize resources. I think we are beating this down to death. Like I said, this error SHOULD NOT happen in practice. I don't know whey the assert was originally put in place. That said, I didn't want to blindly remove it without knowing why it was there to begin with. If ever we run into this in real life, we can fix the message.
            zjshen Zhijie Shen added a comment -

            Potentially I don't see when we will in fact start ContainerLaunch#call without its all resources getting downloaded.

            YARN-906 is such a corner case.

            This I still see should not be done via NULL check. Proper way is to set boolean flag of ContainerLaunch in the event of KILL synchronously.

            The original code checks state == LOCALIZED, and throws AssertError when getting the localized resources. I just modified the way to indicate the error, such that the callers of it can more easily handle the error. If you think calling getLocalizedResources() when the container is not at LOCALIZED is not wrong, I'm afraid we're in the different conversation.

            which is completely misleading.. Indeed this occurred because user killed container not because it failed to localize resources.

            I don't think the message is misleading. Again, getLocalizedResources() is not allowed to be called when the container is not at LOCALIZED (at least the original code means it). So the message clearly states problem. Please note that killing signal is not the root problem of the thread failure here. If getLocalizedResources() were not called, the thread would still complete without exception.

            zjshen Zhijie Shen added a comment - Potentially I don't see when we will in fact start ContainerLaunch#call without its all resources getting downloaded. YARN-906 is such a corner case. This I still see should not be done via NULL check. Proper way is to set boolean flag of ContainerLaunch in the event of KILL synchronously. The original code checks state == LOCALIZED, and throws AssertError when getting the localized resources. I just modified the way to indicate the error, such that the callers of it can more easily handle the error. If you think calling getLocalizedResources() when the container is not at LOCALIZED is not wrong, I'm afraid we're in the different conversation. which is completely misleading.. Indeed this occurred because user killed container not because it failed to localize resources. I don't think the message is misleading. Again, getLocalizedResources() is not allowed to be called when the container is not at LOCALIZED (at least the original code means it). So the message clearly states problem. Please note that killing signal is not the root problem of the thread failure here. If getLocalizedResources() were not called, the thread would still complete without exception.

            People

              zjshen Zhijie Shen
              zjshen Zhijie Shen
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: