Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-9619

SimulatedFSDataset sometimes can not find blockpool for the correct namenode

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.0.0-alpha1
    • Fix Version/s: 2.8.0, 3.0.0-alpha1
    • Component/s: datanode, test
    • Labels:
    • Environment:

      Jenkins

    • Hadoop Flags:
      Reviewed

      Description

      We sometimes see TestBalancerWithMultipleNameNodes.testBalancer failed to replicate a file, because a data node is excluded.

      File /tmp.txt could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and 1 node(s) are excluded in this operation.
       at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1745)
       at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:299)
       at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2390)
       at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:797)
       at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:500)
       at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
       at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637)
       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2305)
       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2301)
       at java.security.AccessController.doPrivileged(Native Method)
       at javax.security.auth.Subject.doAs(Subject.java:415)
       at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1705)
       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2299)
      

      Relevent logs suggest root cause is due to block pool not found.

      2016-01-03 22:11:43,174 [DataXceiver for client DFSClient_NONMAPREDUCE_849671738_1 at /127.0.0.1:47318 [Receiving block BP-1927700312-172.26.2.1-1451887902222:blk_1073741825_1001]] ERROR datanode.DataNode (DataXceiver.java:run(280)) - host0.foo.com:49997:DataXceiver error processing WRITE_BLOCK operation src: /127.0.0.1:47318 dst: /127.0.0.1:49997
      java.io.IOException: Non existent blockpool BP-1927700312-172.26.2.1-1451887902222
      at org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset.getMap(SimulatedFSDataset.java:583)
      at org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset.createTemporary(SimulatedFSDataset.java:955)
      at org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset.createRbw(SimulatedFSDataset.java:941)
      at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:203)
      at org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1235)
      at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:678)
      at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166)
      at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
      at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253)
      at java.lang.Thread.run(Thread.java:745)
      

      For a bit more context, this test starts a cluster with two name nodes and one data node. The block pools are added, but one of them is not found after added. The root cause is due to an undetected concurrent access in a hash map in SimulatedFSDataset (two block pools are added simultaneously). I added some logs to print blockMap, and saw a few ConcurrentModificationExceptions. The solution would be to use a thread safe class instead, like ConcurrentHashMap.

        Attachments

        1. HDFS-9619.001.patch
          1 kB
          Wei-Chiu Chuang
        2. HDFS-9619.002.patch
          4 kB
          Wei-Chiu Chuang

          Activity

            People

            • Assignee:
              jojochuang Wei-Chiu Chuang
              Reporter:
              jojochuang Wei-Chiu Chuang
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: