[HDFS-4898] BlockPlacementPolicyWithNodeGroup.chooseRemoteRack() fails to properly fallback to local rack - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.2.0, 2.0.4-alpha
Fix Version/s: 1.3.0, 2.1.1-beta
Component/s: namenode
Labels:
None

Hadoop Flags:

Reviewed

Description

As currently implemented, BlockPlacementPolicyWithNodeGroup does not properly fallback to local rack when no nodes are available in remote racks, resulting in an improper NotEnoughReplicasException.

BlockPlacementPolicyWithNodeGroup.java

  @Override
  protected void chooseRemoteRack(int numOfReplicas,
      DatanodeDescriptor localMachine, HashMap<Node, Node> excludedNodes,
      long blocksize, int maxReplicasPerRack, List<DatanodeDescriptor> results,
      boolean avoidStaleNodes) throws NotEnoughReplicasException {
    int oldNumOfReplicas = results.size();
    // randomly choose one node from remote racks
    try {
      chooseRandom(
          numOfReplicas,
          "~" + NetworkTopology.getFirstHalf(localMachine.getNetworkLocation()),
          excludedNodes, blocksize, maxReplicasPerRack, results,
          avoidStaleNodes);
    } catch (NotEnoughReplicasException e) {
      chooseRandom(numOfReplicas - (results.size() - oldNumOfReplicas),
          localMachine.getNetworkLocation(), excludedNodes, blocksize,
          maxReplicasPerRack, results, avoidStaleNodes);
    }
  }

As currently coded the chooseRandom() call in the catch block will never succeed as the set of nodes within the passed in node path (e.g. /rack1/nodegroup1) is entirely contained within the set of excluded nodes (both are the set of nodes within the same nodegroup as the node chosen first replica).

The bug is that the fallback chooseRandom() call in the catch block should be passing in the complement of the node path used in the initial chooseRandom() call in the try block (e.g. /rack1) - namely:

NetworkTopology.getFirstHalf(localMachine.getNetworkLocation())

This will yield the proper fallback behavior of choosing a random node from within the same rack, but still excluding those nodes in the same nodegroup

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

h4898_20130809.patch
09/Aug/13 15:00
2 kB
Tsz-wo Sze
h4898_20130809_b-1.patch
15/Aug/13 05:05
1 kB
Tsz-wo Sze

Issue Links

relates to

HADOOP-8468 Umbrella of enhancements to support different failure and locality topologies

Resolved

Activity

People

Assignee:: Tsz-wo Sze

Reporter:: Eric Sirianni

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 11/Jun/13 14:08

Updated:: 24/Sep/13 23:33

Resolved:: 15/Aug/13 04:58