Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-14429

Block remain in COMMITTED but not COMPLETE caused by Decommission

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.9.2
    • 2.10.0, 3.3.0, 3.2.1, 2.9.3, 3.1.3
    • None
    • None

    Description

      In the following scenario, the Block will remain in the COMMITTED but not COMPLETE state and cannot be closed properly:

      1. Client writes Block(bk1) to three data nodes (dn1/dn2/dn3).
      2. bk1 has been completely written to three data nodes, and the data node succeeds FinalizeBlock, joins IBR and waits to report to NameNode.
      3. The client commits bk1 after receiving the ACK.
      4. When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 enter Decommissioning.
      5. The DN reports the IBR, but the block cannot be completed normally.

       

      Then it will lead to the following related exceptions:

      Exception

      2019-04-02 13:40:31,882 INFO namenode.FSNamesystem (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= minimum = 1) in file xxx
      2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - IPC Server handler 499 on 8020, call Call#122552 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615
      org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not replicated yet: xxx
      at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171)
      at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579)
      at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
      at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
      at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
      at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
      at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
      at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:415)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)

      And will cause the scenario described in HDFS-12747

      The root cause is that addStoredBlock does not consider the case where the replications are in Decommission.
      This problem needs to be fixed like HDFS-11499.

      Attachments

        1. HDFS-14429.branch-2.02.patch
          7 kB
          Yicong Cai
        2. HDFS-14429.branch-2.01.patch
          6 kB
          Yicong Cai
        3. HDFS-14429.03.patch
          7 kB
          Yicong Cai
        4. HDFS-14429.02.patch
          7 kB
          Yicong Cai
        5. HDFS-14429.01.patch
          1 kB
          Yicong Cai

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            caiyicong Yicong Cai
            caiyicong Yicong Cai
            Votes:
            1 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment