Uploaded image for project: 'Ambari'
  1. Ambari
  2. AMBARI-19289

HDFS Service check fails if previous active NN is down

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.2
    • trunk
    • ambari-server
    • None

    Description

      Reproduce steps

      1. Enable namenode HA
      2. Shutdown the active namenode, standby takes over
      3. Run HDFS service check

      hdfs service check script uses

      hdfs dfsadmin -fs hdfs://mycluster -safemode get | grep OFF

      to check if namenode is out of safemode. However this command will fail if 1st NN is down without checking the state of 2nd NN. This is likely a HDFS bug similar to HDFS-8277.

      Proposal

      There are several approaches to fix this

      1. Loop each namenode address and get safemode with hdfs dfsadmin -fs hdfs://nn_host:8020 -safemode get | grep OFF, as long as there is one NN returns OFF, consider DFS is not in safemode and continue the rest of check. However is it really necessary to add such complexity for service check?
      2. Remove the safemode check code, if HDFS is in safemode, read/write operations will fail anyway so service check won't pass

      I am preferring to #2 because it makes script simpler and work in all cases. Note this is service check, it should pass as long as HDFS is in working state. It is not namenode check.

      Attachments

        1. AMBARI-19289_trunk.01.patch
          2 kB
          Weiwei Yang
        2. AMBARI-19289_trunk.02.patch
          5 kB
          Weiwei Yang
        3. AMBARI-19289_branch-2.5.01.patch
          3 kB
          Weiwei Yang

        Issue Links

          Activity

            People

              cheersyang Weiwei Yang
              cheersyang Weiwei Yang
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: