Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-3678

DelayedProcessKiller may kill other process other than container

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Duplicate
    • Affects Version/s: 2.6.0, 2.7.2
    • Fix Version/s: None
    • Component/s: nodemanager
    • Labels:
      None

      Description

      Suppose one container finished, then it will do clean up, the PID file still exist and will trigger once singalContainer, this will kill the process with the pid in PID file, but as container already finished, so this PID may be occupied by other process, this may cause serious issue.
      As I know, my NM was killed unexpectedly, what I described can be the cause. Even rarely occur.

        Issue Links

          Activity

          Hide
          kirklg Kirk Leon Guerrero added a comment -

          I added 2.7.2 to the affected versions, my NM receives SIGKILL often enough its a problem, and most often SIGTERM. (non-secure mode)

          Show
          kirklg Kirk Leon Guerrero added a comment - I added 2.7.2 to the affected versions, my NM receives SIGKILL often enough its a problem, and most often SIGTERM. (non-secure mode)
          Hide
          gu chi gu-chi added a comment -

          same issue as confirmed with Jun Gong

          Show
          gu chi gu-chi added a comment - same issue as confirmed with Jun Gong
          Hide
          xuchenCN Xu Chen added a comment -

          LGTM +1

          Show
          xuchenCN Xu Chen added a comment - LGTM +1
          Hide
          gu chi gu-chi added a comment -
          Show
          gu chi gu-chi added a comment - I made this https://github.com/apache/hadoop/pull/20/
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          -1 patch 0m 0s The patch command could not apply the patch during dryrun.



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12735592/YARN-3678.patch
          Optional Tests  
          git revision trunk / bb18163
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/8099/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment -1 patch 0m 0s The patch command could not apply the patch during dryrun. Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12735592/YARN-3678.patch Optional Tests   git revision trunk / bb18163 Console output https://builds.apache.org/job/PreCommit-YARN-Build/8099/console This message was automatically generated.
          Hide
          zhiguohong Hong Zhiguo added a comment -

          the event sequence:
          call "SEND SIGTERM" -> pid recycle -> call "SEND SIGKILL" -> check process live time(based on current time)

          The time between [call "SEND SIGTERM"] and [call "SEND SIGKILL"] is 250ms
          The time between [pid recycle] and [check process live time] may be shorter or longer than 250ms. When it's longer than 250ms, there's chance we make false positive judgement.

          Show
          zhiguohong Hong Zhiguo added a comment - the event sequence: call "SEND SIGTERM" -> pid recycle -> call "SEND SIGKILL" -> check process live time(based on current time) The time between [call "SEND SIGTERM"] and [call "SEND SIGKILL"] is 250ms The time between [pid recycle] and [check process live time] may be shorter or longer than 250ms. When it's longer than 250ms, there's chance we make false positive judgement.
          Hide
          varun_saxena Varun Saxena added a comment -

          Yeah that's why I said if we can increase value of pid_max on a 64-bit machine to highest value it can take i.e. 2^22, that should mitigate the risk of this happening. But anyways, as I mentioned above, we can fix this though.

          Show
          varun_saxena Varun Saxena added a comment - Yeah that's why I said if we can increase value of pid_max on a 64-bit machine to highest value it can take i.e. 2^22, that should mitigate the risk of this happening. But anyways, as I mentioned above, we can fix this though.
          Hide
          vvasudev Varun Vasudev added a comment -

          Hong Zhiguo - sorry my question was - after applying your fix, the problem should have gone away. However, you said - "With this fix, the "accident" rate is reduced from several times per day to nearly zero." Do you know why it still happened?

          Show
          vvasudev Varun Vasudev added a comment - Hong Zhiguo - sorry my question was - after applying your fix, the problem should have gone away. However, you said - "With this fix, the "accident" rate is reduced from several times per day to nearly zero." Do you know why it still happened?
          Hide
          zhiguohong Hong Zhiguo added a comment -

          First, "stop container" happens frequently.
          Second, the pid recycle doesn't need to have a whole round in 250ms. Only need to have one or more rounds during the container lifetime.

          If we have 100 times of "stop container" happen on one node per day, we have 100/32768, about 0.3% chance for one node one day. That's not very low, especially when we have 5000 nodes.

          Show
          zhiguohong Hong Zhiguo added a comment - First, "stop container" happens frequently. Second, the pid recycle doesn't need to have a whole round in 250ms. Only need to have one or more rounds during the container lifetime. If we have 100 times of "stop container" happen on one node per day, we have 100/32768, about 0.3% chance for one node one day. That's not very low, especially when we have 5000 nodes.
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          Hi Varun Vasudev & Hong Zhiguo, For us it happened in secure setup and one key point is both the NM user and user of the container is same . But irrespective of this it could have killed any other process[container] for same/another app running in the same node, submitted by the same user. One suggestion(crude fix not sure how to get it working for other OS) is can we grep for the containerID and confirm its the same process we are targetting and then kill it ?

          Show
          Naganarasimha Naganarasimha G R added a comment - Hi Varun Vasudev & Hong Zhiguo , For us it happened in secure setup and one key point is both the NM user and user of the container is same . But irrespective of this it could have killed any other process [container] for same/another app running in the same node, submitted by the same user. One suggestion(crude fix not sure how to get it working for other OS) is can we grep for the containerID and confirm its the same process we are targetting and then kill it ?
          Hide
          varun_saxena Varun Saxena added a comment -

          As Hong Zhiguo mentioned, even in our case same user is used for NM and app-submitter.

          Show
          varun_saxena Varun Saxena added a comment - As Hong Zhiguo mentioned, even in our case same user is used for NM and app-submitter.
          Hide
          varun_saxena Varun Saxena added a comment -

          Secure.

          Show
          varun_saxena Varun Saxena added a comment - Secure.
          Hide
          varun_saxena Varun Saxena added a comment -

          I think if we increase the value of pid_max, issue is unlikely to occur.

          Show
          varun_saxena Varun Saxena added a comment - I think if we increase the value of pid_max , issue is unlikely to occur.
          Hide
          vvasudev Varun Vasudev added a comment -

          Hong Zhiguo thanks for the detailed explanation! When you say your fix reduced the rate to nearly zero, do you know why the accidental kill continued to happen?

          Show
          vvasudev Varun Vasudev added a comment - Hong Zhiguo thanks for the detailed explanation! When you say your fix reduced the rate to nearly zero, do you know why the accidental kill continued to happen?
          Hide
          zhiguohong Hong Zhiguo added a comment -

          We met same issue on our production cluster last year. The same user is used for NM and some app-submitter.
          I hacked the kernel __send_signal function via kprobe (https://github.com/honkiko/signal-monitor) and confirmed the happening:

          • container-executor sends SIGTERM to a container (say, pid = X)
          • The container exits quickly (in 250ms)
          • pid X is recycled and taken by a new spawned thread of NM
          • after 250ms, container-executor sends SIGKILL to pid X
          • NM is killed

          I added checking of living time before container-executor sends SIGKILL. If the process has living time shorter than 250ms, it's not the target process that we send SIGTERM to, and just skip it.

          With this fix, the "accident" rate is reduced from several times per day to nearly zero.
          If you think such fix is acceptable, I'll post it here.

          Show
          zhiguohong Hong Zhiguo added a comment - We met same issue on our production cluster last year. The same user is used for NM and some app-submitter. I hacked the kernel __send_signal function via kprobe ( https://github.com/honkiko/signal-monitor ) and confirmed the happening: container-executor sends SIGTERM to a container (say, pid = X) The container exits quickly (in 250ms) pid X is recycled and taken by a new spawned thread of NM after 250ms, container-executor sends SIGKILL to pid X NM is killed I added checking of living time before container-executor sends SIGKILL. If the process has living time shorter than 250ms, it's not the target process that we send SIGTERM to, and just skip it. With this fix, the "accident" rate is reduced from several times per day to nearly zero. If you think such fix is acceptable, I'll post it here.
          Hide
          vvasudev Varun Vasudev added a comment -

          Is this in secure or non-secure mode?

          Show
          vvasudev Varun Vasudev added a comment - Is this in secure or non-secure mode?
          Hide
          varun_saxena Varun Saxena added a comment -

          Vinod Kumar Vavilapalli, I suspect its a single user for everything. gu-chi can confirm though.

          Show
          varun_saxena Varun Saxena added a comment - Vinod Kumar Vavilapalli , I suspect its a single user for everything. gu-chi can confirm though.
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          Tx for the update Varun Saxena. You mentioned LCE. But like I said before, LCE kills containers as the app-submitter. So, in your case, what is the user running the containers?

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - Tx for the update Varun Saxena . You mentioned LCE. But like I said before, LCE kills containers as the app-submitter. So, in your case, what is the user running the containers?
          Hide
          varun_saxena Varun Saxena added a comment -

          Vinod Kumar Vavilapalli, as this issue happened in our customer deployment, I will explain the issue. We got an issue wherein NM was being randomly killed at one of the places where Hadoop distribution is deployed. In logs, we could see NM being killed immediately after signalContainer is called. What happens is as under :

          1. LCE sends a SIGTERM to the container and waits for 250 ms
          2. Probably within this 250 ms period, container processes the signal and exits gracefully.
          3. Now it is possible the pid assigned to container is taken up by some other process or thread(which run as light weight processes in Linux).
          4. When LCE again tries to send a SIGKILL to the same pid, it might actually be sending it to another process or thread.
          5. As we could not find any other reason for NM going randomly down, we suspect it may have gone down because some new thread of NM took up this pid and SIGKILL may have been sent to it, which may have crashed NM. This is more based on suspicion though rather than fool proof analysis. Not sure how to verify if this indeed happened.

          Pls note that pid_max in the deployment was 32768.
          I am not sure about which user was the process owner though. Probably gu-chi can shed some light on that.
          An additional check can be done IMHO.

          Show
          varun_saxena Varun Saxena added a comment - Vinod Kumar Vavilapalli , as this issue happened in our customer deployment, I will explain the issue. We got an issue wherein NM was being randomly killed at one of the places where Hadoop distribution is deployed. In logs, we could see NM being killed immediately after signalContainer is called. What happens is as under : LCE sends a SIGTERM to the container and waits for 250 ms Probably within this 250 ms period, container processes the signal and exits gracefully. Now it is possible the pid assigned to container is taken up by some other process or thread(which run as light weight processes in Linux). When LCE again tries to send a SIGKILL to the same pid, it might actually be sending it to another process or thread. As we could not find any other reason for NM going randomly down, we suspect it may have gone down because some new thread of NM took up this pid and SIGKILL may have been sent to it, which may have crashed NM. This is more based on suspicion though rather than fool proof analysis. Not sure how to verify if this indeed happened. Pls note that pid_max in the deployment was 32768 . I am not sure about which user was the process owner though. Probably gu-chi can shed some light on that. An additional check can be done IMHO.
          Hide
          gu chi gu-chi added a comment -

          The PID number may be not use as a process, also can be a thread, linux treat process and thread the same, kill one thread in process may also kill the process too, for thread, 250ms is possible to start, rt?

          Show
          gu chi gu-chi added a comment - The PID number may be not use as a process, also can be a thread, linux treat process and thread the same, kill one thread in process may also kill the process too, for thread, 250ms is possible to start, rt?
          Hide
          gu chi gu-chi added a comment -

          I see the possibility is low, but with heavy task load, it occurs frequently. I would suggest to add a check before kill, check if the process ID belongs to the container.

          Show
          gu chi gu-chi added a comment - I see the possibility is low, but with heavy task load, it occurs frequently. I would suggest to add a check before kill, check if the process ID belongs to the container.
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          The default delay is 250 milliseconds. So it is very hard to hit this condition.

          At least when LinuxContainerExecutor is used, the kill is done as the user itself, so it's unlikely it will affect other users' processes.

          Other than also doing a user-check to ensure its the same user's container, I am not sure what else can be done.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - The default delay is 250 milliseconds. So it is very hard to hit this condition. At least when LinuxContainerExecutor is used, the kill is done as the user itself, so it's unlikely it will affect other users' processes. Other than also doing a user-check to ensure its the same user's container, I am not sure what else can be done.
          Hide
          gu chi gu-chi added a comment -

          I think if decrease the max_pid setting in OS can enlarge the possibility of reproducing, working on

          Show
          gu chi gu-chi added a comment - I think if decrease the max_pid setting in OS can enlarge the possibility of reproducing, working on

            People

            • Assignee:
              Unassigned
              Reporter:
              gu chi gu-chi
            • Votes:
              0 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development