Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-4744

Too many signal to container failure in case of LCE

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.9.0
    • Fix Version/s: 2.8.0, 3.0.0-alpha1
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Install HA cluster in secure mode
      Enable LCE with cgroups
      Start server with dsperf user
      Submit mapreduce application terasort/teragen with user yarn/dsperf
      Too many signal to container failure

      Submit with user the exception is thrown

      2014-03-02 09:20:38,689 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for testing (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB
      2014-03-02 09:20:40,158 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_e02_1393731146548_0001_01_000013
      2014-03-02 09:20:43,071 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container container_e02_1393731146548_0001_01_000009 succeeded
      2014-03-02 09:20:43,072 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_e02_1393731146548_0001_01_000009 transitioned from RUNNING to EXITED_WITH_SUCCESS
      2014-03-02 09:20:43,073 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_e02_1393731146548_0001_01_000009
      2014-03-02 09:20:43,075 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime: Using container runtime: DefaultLinuxContainerRuntime
      2014-03-02 09:20:43,081 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 9. Privileged Execution Operation Output:
      main : command provided 2
      main : run as user is yarn
      main : requested yarn user is yarn
      Full command array for failed execution:
      [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, yarn, yarn, 2, 9370, 15]
      2014-03-02 09:20:43,081 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime: Signal container failed. Exception:
      org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: ExitCodeException exitCode=9:
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173)
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132)
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109)
              at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:513)
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:520)
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:139)
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:55)
              at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
              at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
              at java.lang.Thread.run(Thread.java:745)
      Caused by: ExitCodeException exitCode=9:
              at org.apache.hadoop.util.Shell.runCommand(Shell.java:927)
              at org.apache.hadoop.util.Shell.run(Shell.java:838)
              at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117)
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150)
              ... 9 more
      2014-03-02 09:20:43,113 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=yarn OPERATION=Container Finished - Succeeded        TARGET=ContainerImpl    RESULT=SUCCESS  APPID=application_1393731146548_0001    CONTAINERID=container_e02_1393731146548_0001_01_000009
      2014-03-02 09:20:43,115 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_e02_1393731146548_0001_01_000009 transitioned from EXITED_WITH_SUCCESS to DONE
      2014-03-02 09:20:43,115 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: Removing container_e02_1393731146548_0001_01_000009 from application application_1393731146548_0001
      
      

      Checked the same scenario in 2.7.2 version (not available)

      1. YARN-4744.001.patch
        6 kB
        Sidharta Seethana
      2. YARN-4744.002.patch
        16 kB
        Sidharta Seethana

        Activity

        Hide
        sidharta-s Sidharta Seethana added a comment -

        Hi Bibin A Chundatt,

        Could you provide some additional information here : Is security enabled ? Is this problem reproducible with included apps - e.g distributed shell ? Is it possible the container exited before the signal was delivered (exit code 9 is possible in this scenario) ?

        thanks,
        -Sidharta

        Show
        sidharta-s Sidharta Seethana added a comment - Hi Bibin A Chundatt , Could you provide some additional information here : Is security enabled ? Is this problem reproducible with included apps - e.g distributed shell ? Is it possible the container exited before the signal was delivered (exit code 9 is possible in this scenario) ? thanks, -Sidharta
        Hide
        bibinchundatt Bibin A Chundatt added a comment -

        Hi Sidharta Seethana
        Thank you for looking into the issue.

        1. Is security enabled ? Yes
        2. Is this problem reproducible ? Yes always , submit mapreduce job from cli
        2014-03-02 09:20:43,073 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_e02_1393731146548_0001_01_000009
        

        Container cleanup is getting called for containers already EXITED_WITH_SUCCESS too.

        Update the description logs too.

        Show
        bibinchundatt Bibin A Chundatt added a comment - Hi Sidharta Seethana Thank you for looking into the issue. Is security enabled ? Yes Is this problem reproducible ? Yes always , submit mapreduce job from cli 2014-03-02 09:20:43,073 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_e02_1393731146548_0001_01_000009 Container cleanup is getting called for containers already EXITED_WITH_SUCCESS too. Update the description logs too.
        Hide
        sidharta-s Sidharta Seethana added a comment -

        Thanks, I'll assign the issue to myself - I hope to be able to get to this soon.

        Show
        sidharta-s Sidharta Seethana added a comment - Thanks, I'll assign the issue to myself - I hope to be able to get to this soon.
        Hide
        sidharta-s Sidharta Seethana added a comment -

        /cc Varun Vasudev

        It looks like this is an artifact of existing NM behavior - the NM appears to signal containers that have already exited ( as a part of ContainerLaunch.cleanupContainer() ) . This signal operation fails because the process has already exited. These failures were not logged before but they are being logged now because of the centralization of container-executor operations via PrivilegedOperationExecutor - which logs all container-executor failures.

        Show
        sidharta-s Sidharta Seethana added a comment - /cc Varun Vasudev It looks like this is an artifact of existing NM behavior - the NM appears to signal containers that have already exited ( as a part of ContainerLaunch.cleanupContainer() ) . This signal operation fails because the process has already exited. These failures were not logged before but they are being logged now because of the centralization of container-executor operations via PrivilegedOperationExecutor - which logs all container-executor failures.
        Hide
        bibinchundatt Bibin A Chundatt added a comment -

        Sidharta Seethana

        Can we use similar check like LinuxContainerExecutor#isContainerAlive(ContainerLivenessContext ctx).

        Show
        bibinchundatt Bibin A Chundatt added a comment - Sidharta Seethana Can we use similar check like LinuxContainerExecutor#isContainerAlive(ContainerLivenessContext ctx) .
        Hide
        vvasudev Varun Vasudev added a comment -

        Actually, there are two warn statements that are logged. One is in executePrivilegedOperation() in PrivilegedOperationExecutor and the second one is in signalContainer() in DefaultLinuxContainerRuntime.

        I'm unsure of how to handle this. My feeling is that the PrivilegedOperationExecutor should log failures irrespective of the error code but that the DefaultLinuxContainerRuntime shouldn't log the warning for invalid pids(similar to what LinuxContainerExecutor used to do before the refactoring).

        Jason Lowe, Vinod Kumar Vavilapalli, Rohith Sharma K S - what do you think?

        Show
        vvasudev Varun Vasudev added a comment - Actually, there are two warn statements that are logged. One is in executePrivilegedOperation() in PrivilegedOperationExecutor and the second one is in signalContainer() in DefaultLinuxContainerRuntime. I'm unsure of how to handle this. My feeling is that the PrivilegedOperationExecutor should log failures irrespective of the error code but that the DefaultLinuxContainerRuntime shouldn't log the warning for invalid pids(similar to what LinuxContainerExecutor used to do before the refactoring). Jason Lowe , Vinod Kumar Vavilapalli , Rohith Sharma K S - what do you think?
        Hide
        jlowe Jason Lowe added a comment -

        Can we use similar check like LinuxContainerExecutor#isContainerAlive(ContainerLivenessContext ctx).

        That function is implemented in terms of signalContainer (so we have the same issue), and the process could exit between the check and the subsequent kill attempt.

        My feeling is that the PrivilegedOperationExecutor should log failures irrespective of the error code

        There's always going to be a race where a container can exit before it gets killed, and I'm not sure we accomplish much besides alarming users when we log warnings when that occurs. IMHO PrivilegedOperationExecutor should not be the one that decides what should and shouldn't be logged, since it doesn't have any context on whether the error is severe enough to warrant it. Instead I think we should ensure the same data is present in the PrivilegedOperationException and let the code handling that error perform the logging if it is appropriate to do so.

        Show
        jlowe Jason Lowe added a comment - Can we use similar check like LinuxContainerExecutor#isContainerAlive(ContainerLivenessContext ctx). That function is implemented in terms of signalContainer (so we have the same issue), and the process could exit between the check and the subsequent kill attempt. My feeling is that the PrivilegedOperationExecutor should log failures irrespective of the error code There's always going to be a race where a container can exit before it gets killed, and I'm not sure we accomplish much besides alarming users when we log warnings when that occurs. IMHO PrivilegedOperationExecutor should not be the one that decides what should and shouldn't be logged, since it doesn't have any context on whether the error is severe enough to warrant it. Instead I think we should ensure the same data is present in the PrivilegedOperationException and let the code handling that error perform the logging if it is appropriate to do so.
        Hide
        vinodkv Vinod Kumar Vavilapalli added a comment -

        the NM appears to signal containers that have already exited ( as a part of ContainerLaunch.cleanupContainer() )

        This is by design. We did this originally so as to ensure cleaning up of any orphaned child-processes or process-groups - even if the root-process exits.

        Show
        vinodkv Vinod Kumar Vavilapalli added a comment - the NM appears to signal containers that have already exited ( as a part of ContainerLaunch.cleanupContainer() ) This is by design. We did this originally so as to ensure cleaning up of any orphaned child-processes or process-groups - even if the root-process exits.
        Hide
        bibinchundatt Bibin A Chundatt added a comment -

        Jason Lowe/Vinod Kumar Vavilapalli

        Confused with the exit code 9 also. From one of the documentation i read ,below are the exit code for container executor

        exit code      | NAME                                    | Description
        -----------------------------------------------------------------------------------------------
        8             |      UNABLE_TO_SIGNAL_CONTAINER  | The container-executor could not signal the container it was passed.
        9              |    INVALID_CONTAINER_PID                    | The PID passed to the container-executor was negative or 0.
        

        The exit code returned when container doesn't exist should have been 8 rt?
        We should recheck the exit code from container-executor and also based on exit code might be able to handle the errors too..

        Show
        bibinchundatt Bibin A Chundatt added a comment - Jason Lowe / Vinod Kumar Vavilapalli Confused with the exit code 9 also. From one of the documentation i read ,below are the exit code for container executor exit code | NAME | Description ----------------------------------------------------------------------------------------------- 8 | UNABLE_TO_SIGNAL_CONTAINER | The container-executor could not signal the container it was passed. 9 | INVALID_CONTAINER_PID | The PID passed to the container-executor was negative or 0. The exit code returned when container doesn't exist should have been 8 rt? We should recheck the exit code from container-executor and also based on exit code might be able to handle the errors too..
        Hide
        sidharta-s Sidharta Seethana added a comment -

        Before PrivilegedOperationExecutor existed, there were several cases where not enough information was being logged about container-executor failures. Centralizing this provided useful information like invocation arguments, shell output etc - which has proved useful for debugging. In all cases except 'invalid pid', an error returned by container-executor is an error. IMO, we shouldn't remove the error logging.

        It looks like signalContainer in LinuxContainerExecutor ignores the exception for the 'invalid pid' case. We could do something this this :

        • Change DefaultContainerRuntime to ignore the 'invalid pid' error as well.
        • Change PrivilegedOperationExecutor / PrivilegedOperation to add the notion of 'ignore failures' for certain kinds of operations . Use this only for signalContainer and let the runtime/executor decide what they want to do.

        I'll submit a patch with these changes.

        Show
        sidharta-s Sidharta Seethana added a comment - Before PrivilegedOperationExecutor existed, there were several cases where not enough information was being logged about container-executor failures. Centralizing this provided useful information like invocation arguments, shell output etc - which has proved useful for debugging. In all cases except 'invalid pid', an error returned by container-executor is an error. IMO, we shouldn't remove the error logging. It looks like signalContainer in LinuxContainerExecutor ignores the exception for the 'invalid pid' case. We could do something this this : Change DefaultContainerRuntime to ignore the 'invalid pid' error as well. Change PrivilegedOperationExecutor / PrivilegedOperation to add the notion of 'ignore failures' for certain kinds of operations . Use this only for signalContainer and let the runtime/executor decide what they want to do. I'll submit a patch with these changes.
        Hide
        sidharta-s Sidharta Seethana added a comment -

        Bibin A Chundatt,

        Those two error codes are used differently. INVALID_CONTAINER_PID is used with errno ESRCH . UNABLE_TO_SIGNAL_CONTAINER is used in other cases. See code below :

        int signal_container_as_user(const char *user, int pid, int sig) {
          if(pid <= 0) {
            return INVALID_CONTAINER_PID;
          }
        
          if (change_user(user_detail->pw_uid, user_detail->pw_gid) != 0) {
            return SETUID_OPER_FAILED;
          }
        
          //Don't continue if the process-group is not alive anymore.
          int has_group = 1;
          if (kill(-pid,0) < 0) {
            if (kill(pid, 0) < 0) {
              if (errno == ESRCH) {
                return INVALID_CONTAINER_PID;
              }
              fprintf(LOGFILE, "Error signalling container %d with %d - %s\n",
        	      pid, sig, strerror(errno));
              return -1;
            } else {
              has_group = 0;
            }
          }
        
          if (kill((has_group ? -1 : 1) * pid, sig) < 0) {
            if(errno != ESRCH) {
              fprintf(LOGFILE, 
                      "Error signalling process group %d with signal %d - %s\n", 
                      -pid, sig, strerror(errno));
              fprintf(stderr, 
                      "Error signalling process group %d with signal %d - %s\n", 
                      -pid, sig, strerror(errno));
              fflush(LOGFILE);
              return UNABLE_TO_SIGNAL_CONTAINER;
            } else {
              return INVALID_CONTAINER_PID;
            }
          }
          fprintf(LOGFILE, "Killing process %s%d with %d\n",
        	  (has_group ? "group " :""), pid, sig);
          return 0;
        }
        
        Show
        sidharta-s Sidharta Seethana added a comment - Bibin A Chundatt , Those two error codes are used differently. INVALID_CONTAINER_PID is used with errno ESRCH . UNABLE_TO_SIGNAL_CONTAINER is used in other cases. See code below : int signal_container_as_user( const char *user, int pid, int sig) { if (pid <= 0) { return INVALID_CONTAINER_PID; } if (change_user(user_detail->pw_uid, user_detail->pw_gid) != 0) { return SETUID_OPER_FAILED; } //Don't continue if the process-group is not alive anymore. int has_group = 1; if (kill(-pid,0) < 0) { if (kill(pid, 0) < 0) { if (errno == ESRCH) { return INVALID_CONTAINER_PID; } fprintf(LOGFILE, "Error signalling container %d with %d - %s\n" , pid, sig, strerror(errno)); return -1; } else { has_group = 0; } } if (kill((has_group ? -1 : 1) * pid, sig) < 0) { if (errno != ESRCH) { fprintf(LOGFILE, "Error signalling process group %d with signal %d - %s\n" , -pid, sig, strerror(errno)); fprintf(stderr, "Error signalling process group %d with signal %d - %s\n" , -pid, sig, strerror(errno)); fflush(LOGFILE); return UNABLE_TO_SIGNAL_CONTAINER; } else { return INVALID_CONTAINER_PID; } } fprintf(LOGFILE, "Killing process %s%d with %d\n" , (has_group ? "group " :""), pid, sig); return 0; }
        Hide
        jlowe Jason Lowe added a comment -

        As long as we're not logging a bunch of warningsf for benign events I'm good. I still think the log-then-throw idiom can be problematic in practice as it tends to lead to double-logging (both by the thrower and by the catcher). I understand the concern to miss logs, and it's safer to double-log than not log at all.

        Show
        jlowe Jason Lowe added a comment - As long as we're not logging a bunch of warningsf for benign events I'm good. I still think the log-then-throw idiom can be problematic in practice as it tends to lead to double-logging (both by the thrower and by the catcher). I understand the concern to miss logs, and it's safer to double-log than not log at all.
        Hide
        sidharta-s Sidharta Seethana added a comment -

        Jason Lowe That was my thinking as well - double logging is better than missed logs. In addition, logging in PrivilegedOperationExecutor includes information that isn't necessarily available when the exception is propagated.

        I'll upload a patch soon, thanks.

        Show
        sidharta-s Sidharta Seethana added a comment - Jason Lowe That was my thinking as well - double logging is better than missed logs. In addition, logging in PrivilegedOperationExecutor includes information that isn't necessarily available when the exception is propagated. I'll upload a patch soon, thanks.
        Hide
        sidharta-s Sidharta Seethana added a comment -

        Uploaded a patch that removes 'invalid pid' signal failure logging. Jason Lowe, could you please take a look?

        Show
        sidharta-s Sidharta Seethana added a comment - Uploaded a patch that removes 'invalid pid' signal failure logging. Jason Lowe , could you please take a look?
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 11s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
        +1 mvninstall 6m 42s trunk passed
        +1 compile 0m 22s trunk passed with JDK v1.8.0_72
        +1 compile 0m 27s trunk passed with JDK v1.7.0_95
        +1 checkstyle 0m 16s trunk passed
        +1 mvnsite 0m 28s trunk passed
        +1 mvneclipse 0m 13s trunk passed
        +1 findbugs 0m 49s trunk passed
        +1 javadoc 0m 23s trunk passed with JDK v1.8.0_72
        +1 javadoc 0m 23s trunk passed with JDK v1.7.0_95
        +1 mvninstall 0m 25s the patch passed
        +1 compile 0m 30s the patch passed with JDK v1.8.0_72
        +1 javac 0m 30s the patch passed
        +1 compile 0m 24s the patch passed with JDK v1.7.0_95
        +1 javac 0m 24s the patch passed
        +1 checkstyle 0m 14s the patch passed
        +1 mvnsite 0m 26s the patch passed
        +1 mvneclipse 0m 10s the patch passed
        +1 whitespace 0m 0s Patch has no whitespace issues.
        +1 findbugs 0m 59s the patch passed
        +1 javadoc 0m 16s the patch passed with JDK v1.8.0_72
        +1 javadoc 0m 20s the patch passed with JDK v1.7.0_95
        +1 unit 9m 30s hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_72.
        +1 unit 9m 47s hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_95.
        +1 asflicense 0m 19s Patch does not generate ASF License warnings.
        34m 29s



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:0ca8df7
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12791050/YARN-4744.001.patch
        JIRA Issue YARN-4744
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux bfc907ba6af7 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 67880cc
        Default Java 1.7.0_95
        Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_72 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95
        findbugs v3.0.0
        JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/10691/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/10691/console
        Powered by Apache Yetus 0.3.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 11s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 6m 42s trunk passed +1 compile 0m 22s trunk passed with JDK v1.8.0_72 +1 compile 0m 27s trunk passed with JDK v1.7.0_95 +1 checkstyle 0m 16s trunk passed +1 mvnsite 0m 28s trunk passed +1 mvneclipse 0m 13s trunk passed +1 findbugs 0m 49s trunk passed +1 javadoc 0m 23s trunk passed with JDK v1.8.0_72 +1 javadoc 0m 23s trunk passed with JDK v1.7.0_95 +1 mvninstall 0m 25s the patch passed +1 compile 0m 30s the patch passed with JDK v1.8.0_72 +1 javac 0m 30s the patch passed +1 compile 0m 24s the patch passed with JDK v1.7.0_95 +1 javac 0m 24s the patch passed +1 checkstyle 0m 14s the patch passed +1 mvnsite 0m 26s the patch passed +1 mvneclipse 0m 10s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 0m 59s the patch passed +1 javadoc 0m 16s the patch passed with JDK v1.8.0_72 +1 javadoc 0m 20s the patch passed with JDK v1.7.0_95 +1 unit 9m 30s hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_72. +1 unit 9m 47s hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_95. +1 asflicense 0m 19s Patch does not generate ASF License warnings. 34m 29s Subsystem Report/Notes Docker Image:yetus/hadoop:0ca8df7 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12791050/YARN-4744.001.patch JIRA Issue YARN-4744 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux bfc907ba6af7 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 67880cc Default Java 1.7.0_95 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_72 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 findbugs v3.0.0 JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/10691/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/10691/console Powered by Apache Yetus 0.3.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        sidharta-s Sidharta Seethana added a comment -

        Testing note : This patch makes minor logging changes - I tested the patch manually using distributed shell.

        Show
        sidharta-s Sidharta Seethana added a comment - Testing note : This patch makes minor logging changes - I tested the patch manually using distributed shell.
        Hide
        jlowe Jason Lowe added a comment -

        Thanks for the patch!

        In addition, logging in PrivilegedOperationExecutor includes information that isn't necessarily available when the exception is propagated.

        That problem is solved by having the throwing code either encode that information in the exception message or adding necessary fields to the exception class, allowing the error handler to retrieve them as needed. If the throwing code can create an appropriate log message then it can put that same information in the exception. There's already a custom exception for these errors, so it would be easy to add things like full command line, etc. I still think the code handling the error is the real problem if we're missing appropriate logs, but I don't feel so strongly to block it if others prefer leaving the log-then-throw logic in place.

        Comments on the patch:

        Don't we need to update DockerLinuxContainerRuntime in a similar manner? I think we'll have the same issue there.

        PrivilegedOperation should have a constructor that just takes an opType parameter and the other constructors should be implemented in terms of it. That eliminates the duplicate code maintenance pitfalls and avoids doing odd things like passing nulls as standard practice.

        Show
        jlowe Jason Lowe added a comment - Thanks for the patch! In addition, logging in PrivilegedOperationExecutor includes information that isn't necessarily available when the exception is propagated. That problem is solved by having the throwing code either encode that information in the exception message or adding necessary fields to the exception class, allowing the error handler to retrieve them as needed. If the throwing code can create an appropriate log message then it can put that same information in the exception. There's already a custom exception for these errors, so it would be easy to add things like full command line, etc. I still think the code handling the error is the real problem if we're missing appropriate logs, but I don't feel so strongly to block it if others prefer leaving the log-then-throw logic in place. Comments on the patch: Don't we need to update DockerLinuxContainerRuntime in a similar manner? I think we'll have the same issue there. PrivilegedOperation should have a constructor that just takes an opType parameter and the other constructors should be implemented in terms of it. That eliminates the duplicate code maintenance pitfalls and avoids doing odd things like passing nulls as standard practice.
        Hide
        sidharta-s Sidharta Seethana added a comment -

        Thanks, Jason Lowe. I would prefer to leave the log-then-throw in place right now (or at least keep it outside the scope of this JIRA ).

        About the patch : I didn't modify DockerLinuxContainerRuntime because the signaling there needs additional work - needs to be reimplemented in terms of docker operations. Not sure if I filed a JIRA for that yet, I'll check. I'll fix the PrivilegedOperation constructor.

        Show
        sidharta-s Sidharta Seethana added a comment - Thanks, Jason Lowe . I would prefer to leave the log-then-throw in place right now (or at least keep it outside the scope of this JIRA ). About the patch : I didn't modify DockerLinuxContainerRuntime because the signaling there needs additional work - needs to be reimplemented in terms of docker operations. Not sure if I filed a JIRA for that yet, I'll check. I'll fix the PrivilegedOperation constructor.
        Hide
        jlowe Jason Lowe added a comment -

        Even if the Docker stuff doesn't work totally, it has the same logic and will have the same issue at a high level (i.e.: will always be a race between kill and container exiting on its own) – so why wouldn't we want to make the change at least for doc purposes for those coming along later to fix it?

        Show
        jlowe Jason Lowe added a comment - Even if the Docker stuff doesn't work totally, it has the same logic and will have the same issue at a high level (i.e.: will always be a race between kill and container exiting on its own) – so why wouldn't we want to make the change at least for doc purposes for those coming along later to fix it?
        Hide
        jlowe Jason Lowe added a comment -

        Ah, ignore my previous comment – I see now that we don't have the docker tools in place to know whether or not the kill failed in that way.

        Show
        jlowe Jason Lowe added a comment - Ah, ignore my previous comment – I see now that we don't have the docker tools in place to know whether or not the kill failed in that way.
        Hide
        sidharta-s Sidharta Seethana added a comment -

        Uploaded a new patch - added a new PrivilegedOperation constructor and fixed all instances that had a null second argument to use this new constructor.

        Also filed YARN-4759 to revisit signal handling for docker containers.

        Jason Lowe, please take a look? Thanks!

        Show
        sidharta-s Sidharta Seethana added a comment - Uploaded a new patch - added a new PrivilegedOperation constructor and fixed all instances that had a null second argument to use this new constructor. Also filed YARN-4759 to revisit signal handling for docker containers. Jason Lowe , please take a look? Thanks!
        Hide
        jlowe Jason Lowe added a comment -

        Thanks for updating the patch!

        +1, pending Jenkins.

        Show
        jlowe Jason Lowe added a comment - Thanks for updating the patch! +1, pending Jenkins.
        Hide
        hadoopqa Hadoop QA added a comment -
        +1 overall



        Vote Subsystem Runtime Comment
        0 reexec 12m 54s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
        +1 mvninstall 7m 7s trunk passed
        +1 compile 0m 24s trunk passed with JDK v1.8.0_74
        +1 compile 0m 28s trunk passed with JDK v1.7.0_95
        +1 checkstyle 0m 15s trunk passed
        +1 mvnsite 0m 30s trunk passed
        +1 mvneclipse 0m 13s trunk passed
        +1 findbugs 0m 55s trunk passed
        +1 javadoc 0m 19s trunk passed with JDK v1.8.0_74
        +1 javadoc 0m 22s trunk passed with JDK v1.7.0_95
        +1 mvninstall 0m 26s the patch passed
        +1 compile 0m 22s the patch passed with JDK v1.8.0_74
        +1 javac 0m 22s the patch passed
        +1 compile 0m 24s the patch passed with JDK v1.7.0_95
        +1 javac 0m 24s the patch passed
        +1 checkstyle 0m 14s the patch passed
        +1 mvnsite 0m 28s the patch passed
        +1 mvneclipse 0m 12s the patch passed
        +1 whitespace 0m 0s Patch has no whitespace issues.
        +1 findbugs 1m 6s the patch passed
        +1 javadoc 0m 18s the patch passed with JDK v1.8.0_74
        +1 javadoc 0m 20s the patch passed with JDK v1.7.0_95
        +1 unit 9m 23s hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_74.
        +1 unit 9m 48s hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_95.
        +1 asflicense 0m 18s Patch does not generate ASF License warnings.
        47m 47s



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:0ca8df7
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12791259/YARN-4744.002.patch
        JIRA Issue YARN-4744
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 526b1198d34e 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 0a9f00a
        Default Java 1.7.0_95
        Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_74 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95
        findbugs v3.0.0
        JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/10704/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/10704/console
        Powered by Apache Yetus 0.3.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 12m 54s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files. +1 mvninstall 7m 7s trunk passed +1 compile 0m 24s trunk passed with JDK v1.8.0_74 +1 compile 0m 28s trunk passed with JDK v1.7.0_95 +1 checkstyle 0m 15s trunk passed +1 mvnsite 0m 30s trunk passed +1 mvneclipse 0m 13s trunk passed +1 findbugs 0m 55s trunk passed +1 javadoc 0m 19s trunk passed with JDK v1.8.0_74 +1 javadoc 0m 22s trunk passed with JDK v1.7.0_95 +1 mvninstall 0m 26s the patch passed +1 compile 0m 22s the patch passed with JDK v1.8.0_74 +1 javac 0m 22s the patch passed +1 compile 0m 24s the patch passed with JDK v1.7.0_95 +1 javac 0m 24s the patch passed +1 checkstyle 0m 14s the patch passed +1 mvnsite 0m 28s the patch passed +1 mvneclipse 0m 12s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 1m 6s the patch passed +1 javadoc 0m 18s the patch passed with JDK v1.8.0_74 +1 javadoc 0m 20s the patch passed with JDK v1.7.0_95 +1 unit 9m 23s hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_74. +1 unit 9m 48s hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_95. +1 asflicense 0m 18s Patch does not generate ASF License warnings. 47m 47s Subsystem Report/Notes Docker Image:yetus/hadoop:0ca8df7 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12791259/YARN-4744.002.patch JIRA Issue YARN-4744 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 526b1198d34e 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 0a9f00a Default Java 1.7.0_95 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_74 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 findbugs v3.0.0 JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/10704/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/10704/console Powered by Apache Yetus 0.3.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        sidharta-s Sidharta Seethana added a comment -

        Thanks, Jason Lowe !

        Show
        sidharta-s Sidharta Seethana added a comment - Thanks, Jason Lowe !
        Hide
        sidharta-s Sidharta Seethana added a comment -

        hi Jason Lowe, could you please commit this patch when possible? Thanks!

        Show
        sidharta-s Sidharta Seethana added a comment - hi Jason Lowe , could you please commit this patch when possible? Thanks!
        Hide
        jlowe Jason Lowe added a comment -

        Sorry for the delay, as I was out the last few days.

        Thanks to Sidharta for the contribution and to Bibin and Varun for additional review! I committed this to trunk, branch-2, and branch-2.8.

        Show
        jlowe Jason Lowe added a comment - Sorry for the delay, as I was out the last few days. Thanks to Sidharta for the contribution and to Bibin and Varun for additional review! I committed this to trunk, branch-2, and branch-2.8.
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-trunk-Commit #9434 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9434/)
        YARN-4744. Too many signal to container failure in case of LCE. (jlowe: rev 059caf99891943d9587cac19b48e82efbed06b2d)

        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/privileged/TestPrivilegedOperationExecutor.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/privileged/PrivilegedOperation.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/TrafficController.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsHandlerImpl.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/privileged/PrivilegedOperationExecutor.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/DefaultLinuxContainerRuntime.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/TestCGroupsHandlerImpl.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/DockerLinuxContainerRuntime.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #9434 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9434/ ) YARN-4744 . Too many signal to container failure in case of LCE. (jlowe: rev 059caf99891943d9587cac19b48e82efbed06b2d) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/privileged/TestPrivilegedOperationExecutor.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/privileged/PrivilegedOperation.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/TrafficController.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsHandlerImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/privileged/PrivilegedOperationExecutor.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/DefaultLinuxContainerRuntime.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/TestCGroupsHandlerImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/DockerLinuxContainerRuntime.java

          People

          • Assignee:
            sidharta-s Sidharta Seethana
            Reporter:
            bibinchundatt Bibin A Chundatt
          • Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development