Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-4459

container-executor should only kill process groups

    Details

    • Hadoop Flags:
      Reviewed

      Description

      When calling 'signal_container_as_user' in container-executor, it first checks whether process group exists, if not, it will kill the process itself(if it the process exists). It is not reasonable because that the process group does not exist means corresponding container has finished, if we kill the process itself, we just kill wrong process.

      We found it happened in our cluster many times. We used same account for starting NM and submitted app, and container-executor sometimes killed NM(the wrongly killed process might just be a newly started thread and was NM's child process).

      1. YARN-4459.01.patch
        2 kB
        Jun Gong
      2. YARN-4459.02.patch
        2 kB
        Jun Gong
      3. YARN-4459.03.patch
        4 kB
        Jason Lowe

        Issue Links

          Activity

          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 0s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
          +1 mvninstall 8m 18s trunk passed
          +1 compile 0m 34s trunk passed with JDK v1.8.0_66
          +1 compile 0m 32s trunk passed with JDK v1.7.0_91
          +1 mvnsite 0m 34s trunk passed
          +1 mvneclipse 0m 15s trunk passed
          +1 mvninstall 0m 30s the patch passed
          +1 compile 0m 27s the patch passed with JDK v1.8.0_66
          +1 cc 0m 27s the patch passed
          +1 javac 0m 27s the patch passed
          +1 compile 0m 31s the patch passed with JDK v1.7.0_91
          +1 cc 0m 31s the patch passed
          +1 javac 0m 31s the patch passed
          +1 mvnsite 0m 33s the patch passed
          +1 mvneclipse 0m 14s the patch passed
          -1 whitespace 0m 0s The patch has 1 line(s) with tabs.
          +1 unit 9m 58s hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_66.
          +1 unit 9m 18s hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_91.
          +1 asflicense 0m 24s Patch does not generate ASF License warnings.
          32m 22s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:0ca8df7
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12777742/YARN-4459.01.patch
          JIRA Issue YARN-4459
          Optional Tests asflicense compile cc mvnsite javac unit
          uname Linux d353811253b1 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 0c3a53e
          whitespace https://builds.apache.org/job/PreCommit-YARN-Build/9981/artifact/patchprocess/whitespace-tabs.txt
          JDK v1.7.0_91 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9981/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
          Max memory used 76MB
          Powered by Apache Yetus 0.1.0 http://yetus.apache.org
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/9981/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 0s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 8m 18s trunk passed +1 compile 0m 34s trunk passed with JDK v1.8.0_66 +1 compile 0m 32s trunk passed with JDK v1.7.0_91 +1 mvnsite 0m 34s trunk passed +1 mvneclipse 0m 15s trunk passed +1 mvninstall 0m 30s the patch passed +1 compile 0m 27s the patch passed with JDK v1.8.0_66 +1 cc 0m 27s the patch passed +1 javac 0m 27s the patch passed +1 compile 0m 31s the patch passed with JDK v1.7.0_91 +1 cc 0m 31s the patch passed +1 javac 0m 31s the patch passed +1 mvnsite 0m 33s the patch passed +1 mvneclipse 0m 14s the patch passed -1 whitespace 0m 0s The patch has 1 line(s) with tabs. +1 unit 9m 58s hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_66. +1 unit 9m 18s hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_91. +1 asflicense 0m 24s Patch does not generate ASF License warnings. 32m 22s Subsystem Report/Notes Docker Image:yetus/hadoop:0ca8df7 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12777742/YARN-4459.01.patch JIRA Issue YARN-4459 Optional Tests asflicense compile cc mvnsite javac unit uname Linux d353811253b1 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 0c3a53e whitespace https://builds.apache.org/job/PreCommit-YARN-Build/9981/artifact/patchprocess/whitespace-tabs.txt JDK v1.7.0_91 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9981/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager Max memory used 76MB Powered by Apache Yetus 0.1.0 http://yetus.apache.org Console output https://builds.apache.org/job/PreCommit-YARN-Build/9981/console This message was automatically generated.
          Hide
          hex108 Jun Gong added a comment -

          Update a new patch to fix whitespace error.

          Show
          hex108 Jun Gong added a comment - Update a new patch to fix whitespace error.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 0s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
          +1 mvninstall 8m 16s trunk passed
          +1 compile 0m 31s trunk passed with JDK v1.8.0_66
          +1 compile 0m 30s trunk passed with JDK v1.7.0_91
          +1 mvnsite 0m 31s trunk passed
          +1 mvneclipse 0m 13s trunk passed
          +1 mvninstall 0m 28s the patch passed
          +1 compile 0m 26s the patch passed with JDK v1.8.0_66
          +1 cc 0m 26s the patch passed
          +1 javac 0m 26s the patch passed
          +1 compile 0m 30s the patch passed with JDK v1.7.0_91
          +1 cc 0m 30s the patch passed
          +1 javac 0m 30s the patch passed
          +1 mvnsite 0m 30s the patch passed
          +1 mvneclipse 0m 13s the patch passed
          +1 whitespace 0m 0s Patch has no whitespace issues.
          -1 unit 8m 45s hadoop-yarn-server-nodemanager in the patch failed with JDK v1.8.0_66.
          +1 unit 9m 13s hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_91.
          +1 asflicense 0m 24s Patch does not generate ASF License warnings.
          30m 43s



          Reason Tests
          JDK v1.8.0_66 Failed junit tests hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:0ca8df7
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12777754/YARN-4459.02.patch
          JIRA Issue YARN-4459
          Optional Tests asflicense compile cc mvnsite javac unit
          uname Linux 72717a2b4b82 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 0c3a53e
          unit https://builds.apache.org/job/PreCommit-YARN-Build/9983/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.8.0_66.txt
          unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/9983/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.8.0_66.txt
          JDK v1.7.0_91 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9983/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
          Max memory used 76MB
          Powered by Apache Yetus 0.1.0 http://yetus.apache.org
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/9983/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 0s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 8m 16s trunk passed +1 compile 0m 31s trunk passed with JDK v1.8.0_66 +1 compile 0m 30s trunk passed with JDK v1.7.0_91 +1 mvnsite 0m 31s trunk passed +1 mvneclipse 0m 13s trunk passed +1 mvninstall 0m 28s the patch passed +1 compile 0m 26s the patch passed with JDK v1.8.0_66 +1 cc 0m 26s the patch passed +1 javac 0m 26s the patch passed +1 compile 0m 30s the patch passed with JDK v1.7.0_91 +1 cc 0m 30s the patch passed +1 javac 0m 30s the patch passed +1 mvnsite 0m 30s the patch passed +1 mvneclipse 0m 13s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. -1 unit 8m 45s hadoop-yarn-server-nodemanager in the patch failed with JDK v1.8.0_66. +1 unit 9m 13s hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_91. +1 asflicense 0m 24s Patch does not generate ASF License warnings. 30m 43s Reason Tests JDK v1.8.0_66 Failed junit tests hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService Subsystem Report/Notes Docker Image:yetus/hadoop:0ca8df7 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12777754/YARN-4459.02.patch JIRA Issue YARN-4459 Optional Tests asflicense compile cc mvnsite javac unit uname Linux 72717a2b4b82 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 0c3a53e unit https://builds.apache.org/job/PreCommit-YARN-Build/9983/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.8.0_66.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/9983/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.8.0_66.txt JDK v1.7.0_91 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9983/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager Max memory used 76MB Powered by Apache Yetus 0.1.0 http://yetus.apache.org Console output https://builds.apache.org/job/PreCommit-YARN-Build/9983/console This message was automatically generated.
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          Is this issue similar to YARN-3678 ?

          Show
          Naganarasimha Naganarasimha G R added a comment - Is this issue similar to YARN-3678 ?
          Hide
          hex108 Jun Gong added a comment -

          Thanks Naganarasimha G R for the info. Yes, it seems same problem. I think the problem does not only exist for 'DelayedProcessKiller', it might occur for every call to 'signal_container_as_user'.

          Show
          hex108 Jun Gong added a comment - Thanks Naganarasimha G R for the info. Yes, it seems same problem. I think the problem does not only exist for 'DelayedProcessKiller', it might occur for every call to 'signal_container_as_user'.
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          Hi Jun Gong,
          thanks for working on this jira. I am not from c back ground, neverthless checked the API of kill and few doubts i have here
          IIUC existing code checks whether container process has created any sub process then kill all the process, else if its a single process then i presume kill(-pid,0) will return -1 then it tries to kill only the container process id only. Can you confirm this by testing?
          I just tested this with unix command kill what i could understand was kill -0 – -<pid which has children> will be successfull and $? will return 0 but when i run kill -0 – -<pid which has NO children> then bash: kill: (-10967) - No such process will thrown.
          Correct me if my understanding is wrong.
          cc/ @Varun Vasudev.

          Show
          Naganarasimha Naganarasimha G R added a comment - Hi Jun Gong , thanks for working on this jira. I am not from c back ground, neverthless checked the API of kill and few doubts i have here IIUC existing code checks whether container process has created any sub process then kill all the process, else if its a single process then i presume kill(-pid,0) will return -1 then it tries to kill only the container process id only. Can you confirm this by testing? I just tested this with unix command kill what i could understand was kill -0 – -<pid which has children> will be successfull and $? will return 0 but when i run kill -0 – -<pid which has NO children> then bash: kill: (-10967) - No such process will thrown. Correct me if my understanding is wrong. cc/ @ Varun Vasudev .
          Hide
          hex108 Jun Gong added a comment -

          Thanks Naganarasimha G R for the review, very appreciate it!

          IIUC existing code checks whether container process has created any sub process then kill all the process, else if its a single process then i presume kill(-pid,0) will return -1 then it tries to kill only the container process id only. Can you confirm this by testing?

          I do not have a good method to add a test case for it now. I will try to explain it by following test cases:
          1. If container's parent process does not exist and its child process does exist, existing code works well.

          root@bd74d89a2294:/$ cat test.sh 
          sleep 6666 &
          PID_FOR_SLEEP=$!
          PGID_FOR_SLEEP=$(ps -p $PID_FOR_SLEEP -o pgid=)
          echo "PID for 'sleep 6666' : $PID_FOR_SLEEP, its pgid : $PGID_FOR_SLEEP"
          root@bd74d89a2294:/$ ./test.sh 
          PID for 'sleep 6666' : 26877, its pgid : 26876
          root@bd74d89a2294:/$ ps -ef | grep sleep
          root     26877     1  0 08:48 pts/0    00:00:00 sleep 6666
          root     26880 26797  0 08:48 pts/0    00:00:00 grep sleep
          root@bd74d89a2294:/$ kill -0 -26876
          root@bd74d89a2294:/$ kill -15 -26876
          root@bd74d89a2294:/$ ps -ef | grep sleep
          root     26882 26797  0 08:48 pts/0    00:00:00 grep sleep
          

          2. If container's parent process does not exist and its child process does not exist either, existing code will kill process wrongly.

          root@bd74d89a2294:/$ cat test.sh 
          sleep 2 &
          PID_FOR_SLEEP=$!
          PGID_FOR_SLEEP=$(ps -p $PID_FOR_SLEEP -o pgid=)
          echo "PID for 'sleep 2' : $PID_FOR_SLEEP, its pgid : $PGID_FOR_SLEEP"
          root@bd74d89a2294:/$ ./test.sh 
          PID for 'sleep 2' : 26890, its pgid : 26889
          root@bd74d89a2294:/$ ps -ef | grep sleep
          root     26893 26797  0 08:56 pts/0    00:00:00 grep sleep
          root@bd74d89a2294:/$ kill -0 26889
          -bash: kill: (26889) - No such process
          

          Then we check existing code in container-executor.c for the above case:

            if (kill(-pid,0) < 0) {
              if (kill(pid, 0) < 0) {
                if (errno == ESRCH) {
                  return INVALID_CONTAINER_PID;
                }
                fprintf(LOGFILE, "Error signalling container %d with %d - %s\n",
          	      pid, sig, strerror(errno));
                return -1;
              } else {
                has_group = 0;
              }
            }
          
            if (kill((has_group ? -1 : 1) * pid, sig) < 0) {
          

          kill(-pid,0) will return -1. If pid is reused for a new process(suppose that the container has survived for a long time, pid recycle occurs, then this pid might be reused again), kill(pid, 0) will return 0, then has_group will be set to 0, and the code kill((has_group ? -1 : 1) * pid, sig) will try to kill pid with sig. This is the problem.

          Show
          hex108 Jun Gong added a comment - Thanks Naganarasimha G R for the review, very appreciate it! IIUC existing code checks whether container process has created any sub process then kill all the process, else if its a single process then i presume kill(-pid,0) will return -1 then it tries to kill only the container process id only. Can you confirm this by testing? I do not have a good method to add a test case for it now. I will try to explain it by following test cases: 1. If container's parent process does not exist and its child process does exist, existing code works well. root@bd74d89a2294:/$ cat test.sh sleep 6666 & PID_FOR_SLEEP=$! PGID_FOR_SLEEP=$(ps -p $PID_FOR_SLEEP -o pgid=) echo "PID for 'sleep 6666' : $PID_FOR_SLEEP, its pgid : $PGID_FOR_SLEEP" root@bd74d89a2294:/$ ./test.sh PID for 'sleep 6666' : 26877, its pgid : 26876 root@bd74d89a2294:/$ ps -ef | grep sleep root 26877 1 0 08:48 pts/0 00:00:00 sleep 6666 root 26880 26797 0 08:48 pts/0 00:00:00 grep sleep root@bd74d89a2294:/$ kill -0 -26876 root@bd74d89a2294:/$ kill -15 -26876 root@bd74d89a2294:/$ ps -ef | grep sleep root 26882 26797 0 08:48 pts/0 00:00:00 grep sleep 2. If container's parent process does not exist and its child process does not exist either, existing code will kill process wrongly. root@bd74d89a2294:/$ cat test.sh sleep 2 & PID_FOR_SLEEP=$! PGID_FOR_SLEEP=$(ps -p $PID_FOR_SLEEP -o pgid=) echo "PID for 'sleep 2' : $PID_FOR_SLEEP, its pgid : $PGID_FOR_SLEEP" root@bd74d89a2294:/$ ./test.sh PID for 'sleep 2' : 26890, its pgid : 26889 root@bd74d89a2294:/$ ps -ef | grep sleep root 26893 26797 0 08:56 pts/0 00:00:00 grep sleep root@bd74d89a2294:/$ kill -0 26889 -bash: kill: (26889) - No such process Then we check existing code in container-executor.c for the above case: if (kill(-pid,0) < 0) { if (kill(pid, 0) < 0) { if (errno == ESRCH) { return INVALID_CONTAINER_PID; } fprintf(LOGFILE, "Error signalling container %d with %d - %s\n" , pid, sig, strerror(errno)); return -1; } else { has_group = 0; } } if (kill((has_group ? -1 : 1) * pid, sig) < 0) { kill(-pid,0) will return -1. If pid is reused for a new process(suppose that the container has survived for a long time, pid recycle occurs, then this pid might be reused again), kill(pid, 0) will return 0, then has_group will be set to 0, and the code kill((has_group ? -1 : 1) * pid, sig) will try to kill pid with sig . This is the problem.
          Hide
          vvasudev Varun Vasudev added a comment - - edited

          Jun Gong - assume that the process with the recycled pid does a setsid call - then the process group check will succeed and we might still end up killing the wrong process, no?

          The assumptions of the patch are
          1) The new process will not belong to the same user and
          2) The new process has not called setsid

          Correct?

          I suspect we might need to add a timing check similar to the one proposed in https://issues.apache.org/jira/browse/YARN-3678?focusedCommentId=14560578&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14560578

          Show
          vvasudev Varun Vasudev added a comment - - edited Jun Gong - assume that the process with the recycled pid does a setsid call - then the process group check will succeed and we might still end up killing the wrong process, no? The assumptions of the patch are 1) The new process will not belong to the same user and 2) The new process has not called setsid Correct? I suspect we might need to add a timing check similar to the one proposed in https://issues.apache.org/jira/browse/YARN-3678?focusedCommentId=14560578&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14560578
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          +1 for timing check solution than this approrach !

          Show
          Naganarasimha Naganarasimha G R added a comment - +1 for timing check solution than this approrach !
          Hide
          hex108 Jun Gong added a comment -

          Thanks Varun Vasudev the review and suggestion.

          assume that the process with the recycled pid does a setsid call - then the process group check will succeed and we might still end up killing the wrong process, no?

          Yes, it might kill a wrong process, I have not found a perfect method to avoid this. At least it will not kill NM.

          The assumptions of the patch are
          1) The new process will not belong to the same user and
          2) The new process has not called setsid

          yeah, the and is or actually. It will kill a wrong process when new process belongs to the same user and has been called setsid. The rate is lower.

          I suspect we might need to add a timing check similar to the one proposed in https://issues.apache.org/jira/browse/YARN-3678?focusedCommentId=14560578&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14560578

          Adding this check will reduce the rate that kills a wrong process. Will add it later. It could not avoid killing a wrong process either as Hong Zhiguo said in https://issues.apache.org/jira/browse/YARN-3678?focusedCommentId=14560748&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14560748.

          Show
          hex108 Jun Gong added a comment - Thanks Varun Vasudev the review and suggestion. assume that the process with the recycled pid does a setsid call - then the process group check will succeed and we might still end up killing the wrong process, no? Yes, it might kill a wrong process, I have not found a perfect method to avoid this. At least it will not kill NM. The assumptions of the patch are 1) The new process will not belong to the same user and 2) The new process has not called setsid yeah, the and is or actually. It will kill a wrong process when new process belongs to the same user and has been called setsid. The rate is lower. I suspect we might need to add a timing check similar to the one proposed in https://issues.apache.org/jira/browse/YARN-3678?focusedCommentId=14560578&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14560578 Adding this check will reduce the rate that kills a wrong process. Will add it later. It could not avoid killing a wrong process either as Hong Zhiguo said in https://issues.apache.org/jira/browse/YARN-3678?focusedCommentId=14560748&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14560748 .
          Hide
          jlowe Jason Lowe added a comment -

          Sorry to arrive to this late. I agree that we should be killing the session and not the pid. It's not a perfect solution, but it drastically reduces the likelihood of the wrong process getting killed. This could be improved upon by adding a just-before-kill check of some sort and/or proactive cancelling of the timer when we see the child process exit before the SIGKILL is sent. However rather than holding up this significant improvement waiting for those things to be added, I propose we add this now and further iterate on it in a subsequent JIRA.

          +1 for the patch. Will commit this in a couple of days if there are no objections.

          Show
          jlowe Jason Lowe added a comment - Sorry to arrive to this late. I agree that we should be killing the session and not the pid. It's not a perfect solution, but it drastically reduces the likelihood of the wrong process getting killed. This could be improved upon by adding a just-before-kill check of some sort and/or proactive cancelling of the timer when we see the child process exit before the SIGKILL is sent. However rather than holding up this significant improvement waiting for those things to be added, I propose we add this now and further iterate on it in a subsequent JIRA. +1 for the patch. Will commit this in a couple of days if there are no objections.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 13s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
          +1 mvninstall 6m 37s trunk passed
          +1 compile 0m 24s trunk passed
          +1 mvnsite 0m 26s trunk passed
          +1 mvneclipse 0m 10s trunk passed
          +1 mvninstall 0m 32s the patch passed
          +1 compile 0m 30s the patch passed
          +1 cc 0m 30s the patch passed
          +1 javac 0m 30s the patch passed
          +1 mvnsite 0m 35s the patch passed
          +1 mvneclipse 0m 11s the patch passed
          +1 whitespace 0m 0s Patch has no whitespace issues.
          -1 unit 11m 49s hadoop-yarn-server-nodemanager in the patch failed.
          +1 asflicense 0m 16s Patch does not generate ASF License warnings.
          21m 59s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:2c91fd8
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12777754/YARN-4459.02.patch
          JIRA Issue YARN-4459
          Optional Tests asflicense compile cc mvnsite javac unit
          uname Linux 522b1ceab420 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / ac95448
          Default Java 1.8.0_91
          unit https://builds.apache.org/job/PreCommit-YARN-Build/11634/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/11634/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/11634/console
          Powered by Apache Yetus 0.2.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 13s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 6m 37s trunk passed +1 compile 0m 24s trunk passed +1 mvnsite 0m 26s trunk passed +1 mvneclipse 0m 10s trunk passed +1 mvninstall 0m 32s the patch passed +1 compile 0m 30s the patch passed +1 cc 0m 30s the patch passed +1 javac 0m 30s the patch passed +1 mvnsite 0m 35s the patch passed +1 mvneclipse 0m 11s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. -1 unit 11m 49s hadoop-yarn-server-nodemanager in the patch failed. +1 asflicense 0m 16s Patch does not generate ASF License warnings. 21m 59s Subsystem Report/Notes Docker Image:yetus/hadoop:2c91fd8 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12777754/YARN-4459.02.patch JIRA Issue YARN-4459 Optional Tests asflicense compile cc mvnsite javac unit uname Linux 522b1ceab420 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / ac95448 Default Java 1.8.0_91 unit https://builds.apache.org/job/PreCommit-YARN-Build/11634/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/11634/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/11634/console Powered by Apache Yetus 0.2.0 http://yetus.apache.org This message was automatically generated.
          Hide
          jlowe Jason Lowe added a comment -

          Updated the patch to fix the unit test failure. Now that we're only killing by process group there's no need for the test that signals individual pids directly.

          Show
          jlowe Jason Lowe added a comment - Updated the patch to fix the unit test failure. Now that we're only killing by process group there's no need for the test that signals individual pids directly.
          Hide
          hadoopqa Hadoop QA added a comment -
          +1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 10s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 6m 25s trunk passed
          +1 compile 0m 24s trunk passed
          +1 mvnsite 0m 26s trunk passed
          +1 mvneclipse 0m 11s trunk passed
          +1 mvninstall 0m 21s the patch passed
          +1 compile 0m 22s the patch passed
          +1 cc 0m 22s the patch passed
          +1 javac 0m 22s the patch passed
          +1 mvnsite 0m 30s the patch passed
          +1 mvneclipse 0m 11s the patch passed
          +1 whitespace 0m 0s Patch has no whitespace issues.
          +1 unit 11m 28s hadoop-yarn-server-nodemanager in the patch passed.
          +1 asflicense 0m 18s Patch does not generate ASF License warnings.
          20m 59s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:2c91fd8
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12805718/YARN-4459.03.patch
          JIRA Issue YARN-4459
          Optional Tests asflicense compile cc mvnsite javac unit
          uname Linux 094e9f4f0d09 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / ac95448
          Default Java 1.8.0_91
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/11637/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/11637/console
          Powered by Apache Yetus 0.2.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 10s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 6m 25s trunk passed +1 compile 0m 24s trunk passed +1 mvnsite 0m 26s trunk passed +1 mvneclipse 0m 11s trunk passed +1 mvninstall 0m 21s the patch passed +1 compile 0m 22s the patch passed +1 cc 0m 22s the patch passed +1 javac 0m 22s the patch passed +1 mvnsite 0m 30s the patch passed +1 mvneclipse 0m 11s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 unit 11m 28s hadoop-yarn-server-nodemanager in the patch passed. +1 asflicense 0m 18s Patch does not generate ASF License warnings. 20m 59s Subsystem Report/Notes Docker Image:yetus/hadoop:2c91fd8 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12805718/YARN-4459.03.patch JIRA Issue YARN-4459 Optional Tests asflicense compile cc mvnsite javac unit uname Linux 094e9f4f0d09 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / ac95448 Default Java 1.8.0_91 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/11637/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/11637/console Powered by Apache Yetus 0.2.0 http://yetus.apache.org This message was automatically generated.
          Hide
          hex108 Jun Gong added a comment -

          Thanks Jason Lowe for review and updating the patch!

          This could be improved upon by adding a just-before-kill check of some sort and/or proactive cancelling of the timer when we see the child process exit before the SIGKILL is sent.

          Sorry for not adding it for a long time. I will add it in following jira if it is OK.

          Show
          hex108 Jun Gong added a comment - Thanks Jason Lowe for review and updating the patch! This could be improved upon by adding a just-before-kill check of some sort and/or proactive cancelling of the timer when we see the child process exit before the SIGKILL is sent. Sorry for not adding it for a long time. I will add it in following jira if it is OK.
          Hide
          eepayne Eric Payne added a comment -

          Thanks Jun Gong for the fix and Jason Lowe for the update and reveiw.
          Patch LGTM.
          +1

          Show
          eepayne Eric Payne added a comment - Thanks Jun Gong for the fix and Jason Lowe for the update and reveiw. Patch LGTM. +1
          Hide
          jlowe Jason Lowe added a comment -

          Committing this.

          Show
          jlowe Jason Lowe added a comment - Committing this.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-trunk-Commit #9861 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9861/)
          YARN-4459. container-executor should only kill process groups. (jlowe: rev 1ba31fe9e906dbd093afd4b254216601967a4a7b)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/test-container-executor.c
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #9861 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9861/ ) YARN-4459 . container-executor should only kill process groups. (jlowe: rev 1ba31fe9e906dbd093afd4b254216601967a4a7b) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/test-container-executor.c
          Hide
          jlowe Jason Lowe added a comment -

          Thanks to Jun Gong for the contribution and to Naganarasimha G R, Varun Vasudev, and Eric Payne for additional review! I committed this to trunk, branch-2, branch-2.8, branch-2.7, and branch-2.6.

          YARN-5154 tracks the followup work to further reduce the chance of killing the wrong process when pids are recycled.

          Show
          jlowe Jason Lowe added a comment - Thanks to Jun Gong for the contribution and to Naganarasimha G R , Varun Vasudev , and Eric Payne for additional review! I committed this to trunk, branch-2, branch-2.8, branch-2.7, and branch-2.6. YARN-5154 tracks the followup work to further reduce the chance of killing the wrong process when pids are recycled.
          Hide
          hex108 Jun Gong added a comment -

          Thanks Jason Lowe for the patch, review and commit! Thanks Naganarasimha G R, Varun Vasudev and Eric Payne for the review and comments!

          Show
          hex108 Jun Gong added a comment - Thanks Jason Lowe for the patch, review and commit! Thanks Naganarasimha G R , Varun Vasudev and Eric Payne for the review and comments!
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          Closing the JIRA as part of 2.7.3 release.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - Closing the JIRA as part of 2.7.3 release.

            People

            • Assignee:
              hex108 Jun Gong
              Reporter:
              hex108 Jun Gong
            • Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development