Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1125

Removing a datanode (failed or decommissioned) should not require a namenode restart

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Duplicate
    • Affects Version/s: 0.20.2
    • Fix Version/s: None
    • Component/s: namenode
    • Labels:
      None

      Description

      I've heard of several Hadoop users using dfsadmin -report to monitor the number of dead nodes, and alert if that number is not 0. This mechanism tends to work pretty well, except when a node is decommissioned or fails, because then the namenode requires a restart for said node to be entirely removed from HDFS. More details here:

      http://markmail.org/search/?q=decommissioned%20node%20showing%20up%20ad%20dead%20node%20in%20web%20based%09interface%20to%20namenode#query:decommissioned%20node%20showing%20up%20ad%20dead%20node%20in%20web%20based%09interface%20to%20namenode+page:1+mid:7gwqwdkobgfuszb4+state:results

      Removal from the exclude file and a refresh should get rid of the dead node.

        Issue Links

          Activity

          Hide
          Allen Wittenauer added a comment -

          I've seen this as well.

          The basic premise is that you are removing a node from the grid permanently. So you:

          a) add node to dfs.hosts.exclude
          b) dfsadmin -refreshNodes
          c) wait for decom to finish
          d) remove node from both dfs.hosts and dfs.hosts.exclude

          If you check the web UI and dfsadmin -report, it is still listed as valid.

          Show
          Allen Wittenauer added a comment - I've seen this as well. The basic premise is that you are removing a node from the grid permanently. So you: a) add node to dfs.hosts.exclude b) dfsadmin -refreshNodes c) wait for decom to finish d) remove node from both dfs.hosts and dfs.hosts.exclude If you check the web UI and dfsadmin -report, it is still listed as valid.
          Hide
          Arun Ramakrishnan added a comment -

          related to step c. when would one know if decom is finished ?
          Also i suppose you can remove from excludes same time you remove from slaves files ?

          Show
          Arun Ramakrishnan added a comment - related to step c. when would one know if decom is finished ? Also i suppose you can remove from excludes same time you remove from slaves files ?
          Hide
          Allen Wittenauer added a comment -

          It will show up in the dead node list.

          Show
          Allen Wittenauer added a comment - It will show up in the dead node list.
          Hide
          Allen Wittenauer added a comment -

          This really needs to get fixed for 0.22 . This a huge issue as it makes it difficult to building monitoring tools for alerting purposes.

          Show
          Allen Wittenauer added a comment - This really needs to get fixed for 0.22 . This a huge issue as it makes it difficult to building monitoring tools for alerting purposes.
          Hide
          Nigel Daley added a comment -

          At this point I don't see how this 6 month old unassigned issue is a blocker for 0.22. I also think this is an improvement, not a bug. Removing from 0.22 blocker list.

          Show
          Nigel Daley added a comment - At this point I don't see how this 6 month old unassigned issue is a blocker for 0.22. I also think this is an improvement, not a bug. Removing from 0.22 blocker list.
          Hide
          Allen Wittenauer added a comment -

          That's because you aren't in ops.

          Show
          Allen Wittenauer added a comment - That's because you aren't in ops.
          Hide
          Matthias Friedrich added a comment -

          We also got complaints from our admins about this because it makes it really hard to set up professional monitoring. My company operates close to a 100,000 machines (only a handful Hadoop nodes though), so it's a big concern that our infrastructure behaves well.

          Also, node decommissioning is one of the things QA departments typically test during product
          evaluation, so this could hamper Hadoop adoption in some organizations.

          Show
          Matthias Friedrich added a comment - We also got complaints from our admins about this because it makes it really hard to set up professional monitoring. My company operates close to a 100,000 machines (only a handful Hadoop nodes though), so it's a big concern that our infrastructure behaves well. Also, node decommissioning is one of the things QA departments typically test during product evaluation, so this could hamper Hadoop adoption in some organizations.
          Hide
          Rita M added a comment -

          IMO this should be a blocker. My team has been burnt by this many times.

          Show
          Rita M added a comment - IMO this should be a blocker. My team has been burnt by this many times.
          Hide
          Allen Wittenauer added a comment -

          I'm setting this back to a blocker.

          Show
          Allen Wittenauer added a comment - I'm setting this back to a blocker.
          Hide
          Koji Noguchi added a comment -

          Does HDFS-1773 help?

          Show
          Koji Noguchi added a comment - Does HDFS-1773 help?
          Hide
          philo vivero added a comment -

          Please keep this as blocker. I cannot believe he amount of work I have to go through to decommission a node. This should be nearly automatic.

          Show
          philo vivero added a comment - Please keep this as blocker. I cannot believe he amount of work I have to go through to decommission a node. This should be nearly automatic.
          Hide
          Matt Foley added a comment -

          HDFS-1773 seems to be a duplicate of this, and it is resolved/fixed in trunk (v23) and 0.20.204.0. (Thanks, Koji.) The only requirement seems to be that dfs.hosts is used.

          Show
          Matt Foley added a comment - HDFS-1773 seems to be a duplicate of this, and it is resolved/fixed in trunk (v23) and 0.20.204.0. (Thanks, Koji.) The only requirement seems to be that dfs.hosts is used.
          Hide
          Aaron T. Myers added a comment -

          Allen et al, do you agree with Matt that this issue was addressed by HDFS-1773? Can we close out this issue?

          Show
          Aaron T. Myers added a comment - Allen et al, do you agree with Matt that this issue was addressed by HDFS-1773 ? Can we close out this issue?
          Hide
          Allen Wittenauer added a comment -

          I don't have a working grid with this patch to test it. So no, I can't agree at this point in time.

          Show
          Allen Wittenauer added a comment - I don't have a working grid with this patch to test it. So no, I can't agree at this point in time.
          Hide
          Allen Wittenauer added a comment -

          The problem still seems to be present in 0.20.203, so I'm guessing no, the problem hasn't been fixed by HDFS-1773.

          How I tested:

          a) create a grid with 203, filling in dfs.hosts
          b) populate it with data
          c) put host in dfs.exclude
          d) -refreshNodes, verify host is in decom'ing nodes
          e) let decom process finish
          f) host now shows up in dead
          g) remove host from dfs.host and dfs.exclude
          h) -refreshNodes
          i) node is still listed as dead by nn
          j) kill DataNode process
          k) node is still listed as dead by nn
          l) 10 mins later, still listed...

          Show
          Allen Wittenauer added a comment - The problem still seems to be present in 0.20.203, so I'm guessing no, the problem hasn't been fixed by HDFS-1773 . How I tested: a) create a grid with 203, filling in dfs.hosts b) populate it with data c) put host in dfs.exclude d) -refreshNodes, verify host is in decom'ing nodes e) let decom process finish f) host now shows up in dead g) remove host from dfs.host and dfs.exclude h) -refreshNodes i) node is still listed as dead by nn j) kill DataNode process k) node is still listed as dead by nn l) 10 mins later, still listed...
          Hide
          Rita M added a comment -

          Just curious, has this issue been resolved and someone forgot to close the JIRA Item?

          Show
          Rita M added a comment - Just curious, has this issue been resolved and someone forgot to close the JIRA Item?
          Hide
          Harsh J added a comment -

          Resolved via HDFS-1773. It was in the version after the one Allen tried above I think, thats why he may not have seen it? Please reopen if not.

          Show
          Harsh J added a comment - Resolved via HDFS-1773 . It was in the version after the one Allen tried above I think, thats why he may not have seen it? Please reopen if not.
          Hide
          Allen Wittenauer added a comment -

          It was still broken in 0.20.204, which was the last time I tried.

          Show
          Allen Wittenauer added a comment - It was still broken in 0.20.204, which was the last time I tried.

            People

            • Assignee:
              Unassigned
              Reporter:
              Alex Loddengaard
            • Votes:
              6 Vote for this issue
              Watchers:
              21 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development