Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-15317

Improve NetworkTopology chooseRandom's loop

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.10.0, 2.8.4, 3.2.0, 3.1.1, 2.9.2, 3.0.3
    • Component/s: None
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      Recently we found a postmortem case where the ANN seems to be in an infinite loop. From the logs it seems it just went through a rolling restart, and DNs are getting registered.

      Later the NN become unresponsive, and from the stacktrace it's inside a do-while loop inside NetworkTopology#chooseRandom - part of what's done in HDFS-10320.

      Going through the code and logs I'm not able to come up with any theory (thought about incorrect locking, or the Node object being modified outside of NetworkTopology, both seem impossible) why this is happening, but we should eliminate this loop.

      stacktrace:

       Stack:
      java.util.HashMap.hash(HashMap.java:338)
      java.util.HashMap.containsKey(HashMap.java:595)
      java.util.HashSet.contains(HashSet.java:203)
      org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:786)
      org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:732)
      org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:757)
      org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:692)
      org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:666)
      org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:573)
      org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:461)
      org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:368)
      org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:243)
      org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:115)
      org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4AdditionalDatanode(BlockManager.java:1596)
      org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3599)
      org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:717)
      

        Attachments

        1. HADOOP-15317.01.patch
          2 kB
          Xiao Chen
        2. HADOOP-15317.02.patch
          9 kB
          Xiao Chen
        3. HADOOP-15317.03.patch
          10 kB
          Xiao Chen
        4. HADOOP-15317.04.patch
          10 kB
          Xiao Chen
        5. HADOOP-15317.05.patch
          10 kB
          Xiao Chen
        6. HADOOP-15317.06.patch
          10 kB
          Xiao Chen
        7. Screen Shot 2018-03-28 at 7.23.32 PM.png
          55 kB
          Ajay Kumar

          Issue Links

            Activity

              People

              • Assignee:
                xiaochen Xiao Chen
                Reporter:
                xiaochen Xiao Chen
              • Votes:
                0 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: