Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-6626

Node is marked decommissioned if it becomes dead when it is being decommissioned

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • None
    • None
    • None
    • None

    Description

      Not sure if it is by design. But it isn't intuitive. The scenario is like this, you try to decommission a node; when the node is being decommissioned, the node becomes dead from NN's point of view; right after that NN will mark this node decommissioned. On the webUI, administrators will consider the decommission has completed successfully. That is because when there is no block left for the DN, decommission is considered done.

      BlockManager.java
        boolean isReplicationInProgress(DatanodeDescriptor srcNode) {
          boolean status = false;
      ...
          final Iterator<? extends Block> it = srcNode.getBlockIterator();
          while(it.hasNext()) {
      ...
      // set status if there is block under replication
          }
      ...
          return status;
      }
      

      The question is whether we should mark the dead node as decommission completed (the current behavior), or mark the dead node "decommission aborted". From administrators' point of view, when they are doing decomm, they want to know the status of decomm and the health of those decomm-in-progress nodes. If they can detect decommission failure earlier, they might be able to take actions earlier; for example if the TOR switch has issues during decomm, administrators will be able to quickly find out a bunch of "decommission aborted" nodes from the same rack. People can still find this information by doing the join between decomm node list and recent dead node list on the webUI; just not as convenient.

      Suggestions?

      Attachments

        Activity

          People

            Unassigned Unassigned
            mingma Ming Ma
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: