Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-14429

Block remain in COMMITTED but not COMPLETE caused by Decommission

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.9.2
    • 2.10.0, 3.3.0, 3.2.1, 2.9.3, 3.1.3
    • None
    • None

    Description

      In the following scenario, the Block will remain in the COMMITTED but not COMPLETE state and cannot be closed properly:

      1. Client writes Block(bk1) to three data nodes (dn1/dn2/dn3).
      2. bk1 has been completely written to three data nodes, and the data node succeeds FinalizeBlock, joins IBR and waits to report to NameNode.
      3. The client commits bk1 after receiving the ACK.
      4. When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 enter Decommissioning.
      5. The DN reports the IBR, but the block cannot be completed normally.

       

      Then it will lead to the following related exceptions:

      Exception

      2019-04-02 13:40:31,882 INFO namenode.FSNamesystem (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= minimum = 1) in file xxx
      2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - IPC Server handler 499 on 8020, call Call#122552 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615
      org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not replicated yet: xxx
      at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171)
      at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579)
      at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
      at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
      at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
      at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
      at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
      at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:415)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)

      And will cause the scenario described in HDFS-12747

      The root cause is that addStoredBlock does not consider the case where the replications are in Decommission.
      This problem needs to be fixed like HDFS-11499.

      Attachments

        1. HDFS-14429.01.patch
          1 kB
          Yicong Cai
        2. HDFS-14429.02.patch
          7 kB
          Yicong Cai
        3. HDFS-14429.branch-2.01.patch
          6 kB
          Yicong Cai
        4. HDFS-14429.branch-2.02.patch
          7 kB
          Yicong Cai
        5. HDFS-14429.03.patch
          7 kB
          Yicong Cai

        Issue Links

          Activity

            People

              caiyicong Yicong Cai
              caiyicong Yicong Cai
              Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: