Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-2060

DFSClient should choose a block that is local to the node where the client is running



    • Bug
    • Status: Closed
    • Major
    • Resolution: Invalid
    • None
    • None
    • None
    • None


      When I chase down the DFSClient code to see how the data locality impact the dfs read throughput,
      I realized that DFSClient does not use data locality info (at least not obvious to me)
      when it chooses a block for read from the available replicas.
      Here is the relevant code:

         * Pick the best node from which to stream the data.
         * Entries in <i>nodes</i> are already in the priority order
        private DatanodeInfo bestNode(DatanodeInfo nodes[], 
                                      AbstractMap<DatanodeInfo, DatanodeInfo> deadNodes)
                                      throws IOException {
          if (nodes != null) { 
            for (int i = 0; i < nodes.length; i++) {
              if (!deadNodes.containsKey(nodes[i])) {
                return nodes[i];
          throw new IOException("No live nodes contain current block");
          private DNAddrPair chooseDataNode(LocatedBlock block)
            throws IOException {
            int failures = 0;
            while (true) {
              DatanodeInfo[] nodes = block.getLocations();
              try {
                DatanodeInfo chosenNode = bestNode(nodes, deadNodes);
                InetSocketAddress targetAddr = DataNode.createSocketAddr(chosenNode.getName());
                return new DNAddrPair(chosenNode, targetAddr);
              } catch (IOException ie) {
                String blockInfo = block.getBlock() + " file=" + src;
                if (failures >= MAX_BLOCK_ACQUIRE_FAILURES) {
                  throw new IOException("Could not obtain block: " + blockInfo);
                if (nodes == null || nodes.length == 0) {
                  LOG.info("No node available for block: " + blockInfo);
                LOG.info("Could not obtain block " + block.getBlock() + " from any node:  " + ie);
                try {
                } catch (InterruptedException iex) {
                deadNodes.clear(); //2nd option is to remove only nodes[blockId]

      It seems to pick the first good replica.
      This means that even though some replica is local to the node where the client runs,
      it may actually pick a remote replica.

      Map/reduce tries to schedule a mapper to a node where some copy of the input split data is local to the node.
      However, if the DFSClient does not use that info in choosing replica for read, the mapper may well have to read data
      from the network, even though a local replica is available.

      I hope I missed something and misunderstood the code.
      Otherwise, this will be a serious problem to performance.




            Unassigned Unassigned
            runping Runping Qi
            0 Vote for this issue
            0 Start watching this issue