Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1480

All replicas of a block can end up on the same rack when some datanodes are decommissioning.

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.2
    • Fix Version/s: 0.23.0
    • Component/s: namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      It appears that all replicas of a block can end up in the same rack. The likelihood of such replicas seems to be directly related to decommissioning of nodes.

      Post rolling OS upgrade (decommission 3-10% of nodes, re-install etc, add them back) of a running cluster, all replicas of about 0.16% of blocks ended up in the same rack.

      Hadoop Namenode UI etc doesn't seem to know about such incorrectly replicated blocks. "hadoop fsck .." does report that the blocks must be replicated on additional racks.

      Looking at ReplicationTargetChooser.java, following seem suspect:

      snippet-01:

          int maxNodesPerRack =
            (totalNumOfReplicas-1)/clusterMap.getNumOfRacks()+2;
      

      snippet-02:

            case 2:
              if (clusterMap.isOnSameRack(results.get(0), results.get(1))) {
                chooseRemoteRack(1, results.get(0), excludedNodes,
                                 blocksize, maxNodesPerRack, results);
              } else if (newBlock){
                chooseLocalRack(results.get(1), excludedNodes, blocksize,
                                maxNodesPerRack, results);
              } else {
                chooseLocalRack(writer, excludedNodes, blocksize,
                                maxNodesPerRack, results);
              }
              if (--numOfReplicas == 0) {
                break;
              }
      

      snippet-03:

          do {
            DatanodeDescriptor[] selectedNodes =
              chooseRandom(1, nodes, excludedNodes);
            if (selectedNodes.length == 0) {
              throw new NotEnoughReplicasException(
                                                   "Not able to place enough replicas");
            }
            result = (DatanodeDescriptor)(selectedNodes[0]);
          } while(!isGoodTarget(result, blocksize, maxNodesPerRack, results));
      
      1. hdfs-1480-test.txt
        3 kB
        Todd Lipcon
      2. hdfs-1480.txt
        22 kB
        Todd Lipcon
      3. hdfs-1480.txt
        23 kB
        Todd Lipcon
      4. hdfs-1480.txt
        22 kB
        Todd Lipcon

        Issue Links

          Activity

          Todd Lipcon made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Hadoop Flags [Reviewed]
          Resolution Fixed [ 1 ]
          Todd Lipcon made changes -
          Summary All replicas for a block with repl=2 end up in same rack All replicas of a block can end up on the same rack when some datanodes are decommissioning.
          Todd Lipcon made changes -
          Attachment hdfs-1480.txt [ 12491018 ]
          Todd Lipcon made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Fix Version/s 0.23.0 [ 12315571 ]
          Todd Lipcon made changes -
          Attachment hdfs-1480.txt [ 12490729 ]
          Todd Lipcon made changes -
          Attachment hdfs-1480.txt [ 12490492 ]
          Todd Lipcon made changes -
          Assignee Todd Lipcon [ tlipcon ]
          Todd Lipcon made changes -
          Attachment hdfs-1480-test.txt [ 12483658 ]
          T Meyarivan made changes -
          Summary All replicas for a block end up in same rack All replicas for a block with repl=2 end up in same rack
          Description It appears that all replicas of a block can end up in the same rack. The likelihood of such replicas seems to be directly related to decommissioning of nodes.

          Post rolling OS upgrade (decommission 3-10% of nodes, re-install etc, add them back) of a running cluster, all replicas of about 0.16% of blocks ended up in the same rack.

          Hadoop Namenode UI etc doesn't seem to know about such incorrectly replicated blocks. "hadoop fsck .." does report that the blocks must be replicated on additional racks.

          Looking at ReplicationTargetChooser.java, following seem suspect:

          snippet-01:
          {code}
              int maxNodesPerRack =
                (totalNumOfReplicas-1)/clusterMap.getNumOfRacks()+2;
          {code}

          snippet-02:
          {code}
              if (counter>maxTargetPerLoc) {
                logr.debug("Node "+NodeBase.getPath(node)+
                          " is not chosen because the rack has too many chosen nodes");
                return false;
              }
          {code}

          snippet-03:
          {code}
                default:
                  chooseRandom(numOfReplicas, NodeBase.ROOT, excludedNodes,
                               blocksize, maxNodesPerRack, results);
                }
          {code}
          It appears that all replicas of a block can end up in the same rack. The likelihood of such replicas seems to be directly related to decommissioning of nodes.

          Post rolling OS upgrade (decommission 3-10% of nodes, re-install etc, add them back) of a running cluster, all replicas of about 0.16% of blocks ended up in the same rack.

          Hadoop Namenode UI etc doesn't seem to know about such incorrectly replicated blocks. "hadoop fsck .." does report that the blocks must be replicated on additional racks.

          Looking at ReplicationTargetChooser.java, following seem suspect:

          snippet-01:
          {code}
              int maxNodesPerRack =
                (totalNumOfReplicas-1)/clusterMap.getNumOfRacks()+2;
          {code}

          snippet-02:
          {code}
                case 2:
                  if (clusterMap.isOnSameRack(results.get(0), results.get(1))) {
                    chooseRemoteRack(1, results.get(0), excludedNodes,
                                     blocksize, maxNodesPerRack, results);
                  } else if (newBlock){
                    chooseLocalRack(results.get(1), excludedNodes, blocksize,
                                    maxNodesPerRack, results);
                  } else {
                    chooseLocalRack(writer, excludedNodes, blocksize,
                                    maxNodesPerRack, results);
                  }
                  if (--numOfReplicas == 0) {
                    break;
                  }
          {code}

          snippet-03:
          {code}
              do {
                DatanodeDescriptor[] selectedNodes =
                  chooseRandom(1, nodes, excludedNodes);
                if (selectedNodes.length == 0) {
                  throw new NotEnoughReplicasException(
                                                       "Not able to place enough replicas");
                }
                result = (DatanodeDescriptor)(selectedNodes[0]);
              } while(!isGoodTarget(result, blocksize, maxNodesPerRack, results));
          {code}
          Tsz Wo Nicholas Sze made changes -
          Link This issue relates to HDFS-15 [ HDFS-15 ]
          Tsz Wo Nicholas Sze made changes -
          Field Original Value New Value
          Affects Version/s 0.20.2 [ 12314204 ]
          Priority Minor [ 4 ] Major [ 3 ]
          Description It appears that all replicas of a block can end up in the same rack. The likelihood of such replicas seems to be directly related to decommissioning of nodes.

          Post rolling OS upgrade (decommission 3-10% of nodes, re-install etc, add them back) of a running cluster, all replicas of about 0.16% of blocks ended up in the same rack.

          Hadoop Namenode UI etc doesn't seem to know about such incorrectly replicated blocks. "hadoop fsck .." does report that the blocks must be replicated on additional racks.

          Looking at ReplicationTargetChooser.java, following seem suspect:

          snippet-01:

          """
              int maxNodesPerRack =
                (totalNumOfReplicas-1)/clusterMap.getNumOfRacks()+2;

          """

          snippet-02:

          """
              if (counter>maxTargetPerLoc) {
                logr.debug("Node "+NodeBase.getPath(node)+
                          " is not chosen because the rack has too many chosen nodes");
                return false;
              }
          """

          snippet-03:

          """
                default:
                  chooseRandom(numOfReplicas, NodeBase.ROOT, excludedNodes,
                               blocksize, maxNodesPerRack, results);
                }
          """

          --
          It appears that all replicas of a block can end up in the same rack. The likelihood of such replicas seems to be directly related to decommissioning of nodes.

          Post rolling OS upgrade (decommission 3-10% of nodes, re-install etc, add them back) of a running cluster, all replicas of about 0.16% of blocks ended up in the same rack.

          Hadoop Namenode UI etc doesn't seem to know about such incorrectly replicated blocks. "hadoop fsck .." does report that the blocks must be replicated on additional racks.

          Looking at ReplicationTargetChooser.java, following seem suspect:

          snippet-01:
          {code}
              int maxNodesPerRack =
                (totalNumOfReplicas-1)/clusterMap.getNumOfRacks()+2;
          {code}

          snippet-02:
          {code}
              if (counter>maxTargetPerLoc) {
                logr.debug("Node "+NodeBase.getPath(node)+
                          " is not chosen because the rack has too many chosen nodes");
                return false;
              }
          {code}

          snippet-03:
          {code}
                default:
                  chooseRandom(numOfReplicas, NodeBase.ROOT, excludedNodes,
                               blocksize, maxNodesPerRack, results);
                }
          {code}
          Component/s name-node [ 12312926 ]
          T Meyarivan created issue -

            People

            • Assignee:
              Todd Lipcon
              Reporter:
              T Meyarivan
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development