Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-10320

Rack failures may result in NN terminate

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.6.0
    • 2.8.0, 3.0.0-alpha1
    • None

    Description

      If there're rack failures which end up leaving only 1 rack available, BlockPlacementPolicyDefault#chooseRandom may get InvalidTopologyException when calling NetworkTopology#chooseRandom, which then throws all the way out to BlockManager's ReplicationMonitor thread and terminate the NN.

      Log:

      2016-02-24 09:22:01,514  WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 1 to reach 3 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
      
      2016-02-24 09:22:01,958  ERROR org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: ReplicationMonitor thread received Runtime exception. 
      org.apache.hadoop.net.NetworkTopology$InvalidTopologyException: Failed to find datanode (scope="" excludedScope="/rack_a5").
      	at org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:729)
      	at org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:694)
      	at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:635)
      	at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRemoteRack(BlockPlacementPolicyDefault.java:580)
      	at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:348)
      	at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:214)
      	at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:111)
      	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.chooseTargets(BlockManager.java:3746)
      	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.access$200(BlockManager.java:3711)
      	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1400)
      	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1306)
      	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3682)
      	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3634)
      	at java.lang.Thread.run(Thread.java:745)
      

      Attachments

        1. HDFS-10320.01.patch
          2 kB
          Xiao Chen
        2. HDFS-10320.02.patch
          4 kB
          Xiao Chen
        3. HDFS-10320.03.patch
          21 kB
          Xiao Chen
        4. HDFS-10320.04.patch
          22 kB
          Xiao Chen
        5. HDFS-10320.05.patch
          22 kB
          Xiao Chen
        6. HDFS-10320.06.patch
          22 kB
          Xiao Chen

        Issue Links

          Activity

            People

              xiaochen Xiao Chen
              xiaochen Xiao Chen
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: