Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-8406

Lease recovery continually failed

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 2.6.0
    • Fix Version/s: None
    • Component/s: None

      Description

      While testing Accumulo on a cluster and killing processes, I ran into a situation where the lease on an accumulo write ahead log in HDFS could not be recovered. Even restarting HDFS and Accumulo would not fix the problem.

      The following message was seen in an Accumulo tablet server log immediately before the tablet server was killed.

      2015-05-14 17:12:37,466 [hdfs.DFSClient] WARN : DFSOutputStream ResponseProcessor exception  for block BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192060
      java.io.IOException: Bad response ERROR for block BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192060 from datanode 10.1.5.9:50010
              at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:897)
      2015-05-14 17:12:37,466 [hdfs.DFSClient] WARN : Error Recovery for block BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192060 in pipeline 10.1.5.55:50010, 10.1.5.9:5
      

      Before recovering data from a write ahead log, the Accumulo master attempts to recover the lease. This repeatedly failed with messages like the following.

      2015-05-14 17:14:54,301 [recovery.HadoopLogCloser] WARN : Error recovering lease on hdfs://10.1.5.6:10000/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2
      org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): failed to create file /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 for DFSClient_NONMAPREDUCE_950713214_16 for client 10.1.5.158 because pendingCreates is non-null but no leases found.
      

      Below is some info from the NN logs for the problematic file.

      [ec2-user@leader2 logs]$ grep 3a731759-3594-4535-8086-245 hadoop-ec2-user-namenode-leader2.log 
      2015-05-14 17:10:46,299 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2. BP-802741494-10.1.5.6-1431557089849 blk_1073932823_192060{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW], ReplicaUnderConstruction[[DISK]DS-6efec702-3f1f-4ec0-a31f-de947e7e6097:NORMAL:10.1.5.9:50010|RBW], ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW]]}
      2015-05-14 17:10:46,628 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* fsync: /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 for DFSClient_NONMAPREDUCE_-1128465883_16
      2015-05-14 17:14:49,288 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: recoverLease: [Lease.  Holder: DFSClient_NONMAPREDUCE_-1128465883_16, pendingcreates: 1], src=/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 from client DFSClient_NONMAPREDUCE_-1128465883_16
      2015-05-14 17:14:49,288 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  Holder: DFSClient_NONMAPREDUCE_-1128465883_16, pendingcreates: 1], src=/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2
      2015-05-14 17:14:49,289 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 has not been closed. Lease recovery is in progress. RecoveryId = 192257 for block blk_1073932823_192060{blockUCState=UNDER_RECOVERY, primaryNodeIndex=2, replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW], ReplicaUnderConstruction[[DISK]DS-6efec702-3f1f-4ec0-a31f-de947e7e6097:NORMAL:10.1.5.9:50010|RBW], ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW]]}
      java.lang.IllegalStateException: Failed to finalize INodeFile 3a731759-3594-4535-8086-245eed7cd4c2 since blocks[0] is non-complete, where blocks=[blk_1073932823_192257{blockUCState=COMMITTED, primaryNodeIndex=2, replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW], ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW]]}].
      2015-05-14 17:14:54,292 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 10000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.recoverLease from 10.1.5.158:53784 Call#529 Retry#0: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 for DFSClient_NONMAPREDUCE_950713214_16 for client 10.1.5.158 because pendingCreates is non-null but no leases found.
      2015-05-14 17:14:54,319 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.append: failed to create file /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 for DFSClient_NONMAPREDUCE_950713214_16 for client 10.1.5.158 because pendingCreates is non-null but no leases found.
      2015-05-14 17:14:54,320 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 10000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.append from 10.1.5.158:53784 Call#530 Retry#0: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 for DFSClient_NONMAPREDUCE_950713214_16 for client 10.1.5.158 because pendingCreates is non-null but no leases found.
      

      Below is expanded info from the NN logs about the exception mentioned above.

      2015-05-14 17:14:51,704 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 10000, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.commitBlockSynchronization from 10.1.5.45:35687 Call#66737 Retry#0
      
      java.lang.IllegalStateException: Failed to finalize INodeFile 3a731759-3594-4535-8086-245eed7cd4c2 since blocks[0] is non-complete, where blocks=[blk_1073932823_192257{blockUCState=COMMITTED, primaryNodeIndex=2, replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW], ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW]]}].
              at com.google.common.base.Preconditions.checkState(Preconditions.java:172)
              at org.apache.hadoop.hdfs.server.namenode.INodeFile.assertAllBlocksComplete(INodeFile.java:214)
              at org.apache.hadoop.hdfs.server.namenode.INodeFile.toCompleteFile(INodeFile.java:201)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.finalizeINodeFileUnderConstruction(FSNamesystem.java:4663)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.closeFileCommitBlocks(FSNamesystem.java:4878)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:4842)
              at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.commitBlockSynchronization(NameNodeRpcServer.java:734)
              at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.commitBlockSynchronization(DatanodeProtocolServerSideTranslatorPB.java:270)
              at org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26394)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
              at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
              at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
              at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:415)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
              at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
      

      Below is the output of running fsck on the file.

      [ec2-user@leader2 sbin]$ hdfs fsck /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 -files -blocks -locations -openforwrite
      Connecting to namenode via http://leader2:50070
      FSCK started by ec2-user (auth:SIMPLE) from /10.1.5.6 for path /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 at Thu May 14 19:03:21 UTC 2015
      /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 541965482 bytes, 1 block(s), OPENFORWRITE:  Under replicated BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192257{blockUCState=COMMITTED, primaryNodeIndex=2, replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW], ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW], ReplicaUnderConstruction[[DISK]DS-6efec702-3f1f-4ec0-a31f-de947e7e6097:NORMAL:10.1.5.9:50010|RWR], ReplicaUnderConstruction[[DISK]DS-b42bf32b-dc72-4be5-bf33-e1ee73825420:NORMAL:10.1.5.131:50010|RBW]]}. Target Replicas is 3 but found 2 replica(s).
      0. BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192257{blockUCState=COMMITTED, primaryNodeIndex=2, replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW], ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW], ReplicaUnderConstruction[[DISK]DS-6efec702-3f1f-4ec0-a31f-de947e7e6097:NORMAL:10.1.5.9:50010|RWR], ReplicaUnderConstruction[[DISK]DS-b42bf32b-dc72-4be5-bf33-e1ee73825420:NORMAL:10.1.5.131:50010|RBW]]} len=541965482 repl=2 [10.1.5.55:50010, 10.1.5.45:50010, 10.1.5.9:50010, 10.1.5.131:50010]
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                kturner Keith Turner
              • Votes:
                0 Vote for this issue
                Watchers:
                24 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: