Hadoop Common
  1. Hadoop Common
  2. HADOOP-9085

start namenode failure,bacause pid of namenode pid file is other process pid or thread id before start namenode

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 2.0.1-alpha, 2.0.3-alpha
    • Fix Version/s: None
    • Component/s: bin, scripts
    • Labels:
      None
    • Environment:

      NA

      Description

      pid of namenode pid file is other process pid or thread id before start namenode,start namenode will failure.because the pid of namenode pid file will be checked use kill -0 command before start namenode in hadoop-daemo.sh script.when pid of namenode pid file is other process pid or thread id,checkt is use kil -0 command,and the kill -0 will return success.it means the namenode is runing.in really,namenode is not runing.

      2338 is dead namenode pid
      2305 is datanode pid

      cqn2:/tmp # kill -0 2338
      cqn2:/tmp # ps -wweLo pid,ppid,tid | grep 2338
      2305 1 2338

        Issue Links

          Activity

          Allen Wittenauer made changes -
          Assignee Allen Wittenauer [ aw ]
          Allen Wittenauer made changes -
          Fix Version/s 2.0.1-alpha [ 12322467 ]
          Fix Version/s 2.0.2-alpha [ 12322473 ]
          Fix Version/s 2.7.0 [ 12327583 ]
          Allen Wittenauer made changes -
          Component/s scripts [ 12311393 ]
          Arun C Murthy made changes -
          Fix Version/s 2.7.0 [ 12327583 ]
          Fix Version/s 2.6.0 [ 12327179 ]
          Karthik Kambatla (Inactive) made changes -
          Fix Version/s 2.6.0 [ 12327179 ]
          Fix Version/s 2.5.0 [ 12326263 ]
          Arun C Murthy made changes -
          Fix Version/s 2.5.0 [ 12326263 ]
          Fix Version/s 2.4.0 [ 12326144 ]
          Arun C Murthy made changes -
          Fix Version/s 2.4.0 [ 12326144 ]
          Fix Version/s 2.3.0 [ 12325254 ]
          Arun C Murthy made changes -
          Fix Version/s 2.3.0 [ 12325254 ]
          Fix Version/s 2.4.0 [ 12324587 ]
          Arun C Murthy made changes -
          Fix Version/s 2.3.0 [ 12324587 ]
          Fix Version/s 2.1.0-beta [ 12324030 ]
          Arun C Murthy made changes -
          Fix Version/s 2.0.4-beta [ 12324030 ]
          Fix Version/s 2.0.3-alpha [ 12323273 ]
          Hide
          Steve Loughran added a comment -

          Link to HADOOP-9086, which proposes a more rigorous process check mechanism

          Show
          Steve Loughran added a comment - Link to HADOOP-9086 , which proposes a more rigorous process check mechanism
          Steve Loughran made changes -
          Field Original Value New Value
          Link This issue relates to HADOOP-9086 [ HADOOP-9086 ]
          Hide
          Steve Loughran added a comment -

          Pid recycling is a permanent problem with Unix systems -you are correct that something needs to be done. We can't rely on deleting the pid file on a successful shutdown either, as all forms of killing are "successful" -even server reboot.

          I don't think the proposed patch would work as it's still looking for a file $pid, even though it's no longer needed, and that file is also used in the error text. Better to skip the -f check and use $curpid in the error. Even after tha, it's pretty brittle against unintentional command matches.

          What we need to do is move away from pid-file-liveness tests altogether.

          There is a far more robust alternative, the service started up should create an exclusive write lock on a well-known file. When the process dies, the OS automatically releases this lock. I'll open a JIRA on it.

          Show
          Steve Loughran added a comment - Pid recycling is a permanent problem with Unix systems -you are correct that something needs to be done. We can't rely on deleting the pid file on a successful shutdown either, as all forms of killing are "successful" -even server reboot. I don't think the proposed patch would work as it's still looking for a file $pid , even though it's no longer needed, and that file is also used in the error text. Better to skip the -f check and use $curpid in the error. Even after tha, it's pretty brittle against unintentional command matches. What we need to do is move away from pid-file-liveness tests altogether. There is a far more robust alternative, the service started up should create an exclusive write lock on a well-known file. When the process dies, the OS automatically releases this lock. I'll open a JIRA on it.
          Hide
          liaowenrui added a comment -

          1.when stop namenode is success,we should be delete the namenode pid file.
          2.when we start namenode,we should be ps,not use kill -0.

          code in hadoop-daemo.sh

          if [ -f $pid ]; then
          if kill -0 `cat $pid` > /dev/null 2>&1; then
          echo $command running as process `cat $pid`. Stop it first.
          exit 1
          fi
          fi

          we will change it like this:
          if [ -f $pid ]; then
          tmppid=`cat $pid`
          curpid=`ps -ww -eo pid,user,euid,cmd | grep "org.apache.hadoop.hdfs." | grep "$command" | grep $tmppid | grep -v "grep" | awk '

          {print $1}

          '`
          if [ -n "$curpid" ]; then
          echo $command running as process `cat $pid`. Stop it first.
          exit 1
          fi
          fi

          Show
          liaowenrui added a comment - 1.when stop namenode is success,we should be delete the namenode pid file. 2.when we start namenode,we should be ps,not use kill -0. code in hadoop-daemo.sh if [ -f $pid ]; then if kill -0 `cat $pid` > /dev/null 2>&1; then echo $command running as process `cat $pid`. Stop it first. exit 1 fi fi we will change it like this: if [ -f $pid ]; then tmppid=`cat $pid` curpid=`ps -ww -eo pid,user,euid,cmd | grep "org.apache.hadoop.hdfs." | grep "$command" | grep $tmppid | grep -v "grep" | awk ' {print $1} '` if [ -n "$curpid" ]; then echo $command running as process `cat $pid`. Stop it first. exit 1 fi fi
          liaowenrui created issue -

            People

            • Assignee:
              Allen Wittenauer
              Reporter:
              liaowenrui
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:

                Development