Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-12567

BlockPlacementPolicyRackFaultTolerant fails with racks with very few nodes

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      Found this while doing some testing on an internal cluster with an unusual setup. We have a rack with ~20 nodes, then a few more with just a few nodes. It would fail to get (# data blocks) datanodes even though there were plenty of DNs on the rack with 20 DNs.

      I managed to reproduce this same issue in a unit test, stack trace like this:

      java.io.IOException: File /testfile0 could only be written to 5 of the 6 required nodes for RS-6-3-1024k. There are 9 datanode(s) running and no node(s) are excluded in this operation.
      	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2083)
      	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:286)
      	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2609)
      	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:863)
      	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:548)
      	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
      	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
      	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
      	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
      	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at javax.security.auth.Subject.doAs(Subject.java:422)
      	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
      	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
      

      This isn't a very critical bug since it's an unusual rack configuration, but it can easily happen during testing.

      Attachments

        1. HDFS-12567.001.patch
          8 kB
          Andrew Wang
        2. HDFS-12567.002.patch
          8 kB
          Andrew Wang
        3. HDFS-12567.003.patch
          9 kB
          Andrew Wang
        4. HDFS-12567.repro.patch
          3 kB
          Andrew Wang

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            andrew.wang Andrew Wang
            andrew.wang Andrew Wang
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Issue deployment