Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Cannot Reproduce
-
1.5.0
-
None
-
None
-
testing 1.5.1rc1 on a 10 node cluster, hadoop 2.2.0, zk 3.4.5
Description
I was running accumulo 1.5.1rc1 on a 10 node cluster. After two days, I saw that several tservers had died with OOME. Several hundred tablets were offline.
The master was attempting to recover the write lease on the file, and this was failing.
Attempts to examine the log file failed:
$ hadoop fs -cat /accumulo/wal/192.168.1.5+9997/bc94602a-9a57-45f6-afdf-ffa2a5b70b14 Cannot obtain block length for LocatedBlock{BP-901421341-192.168.1.3-1389719663617:blk_1076582460_2869891; getBlockSize()=0; corrupt=false; offset=0; locs=[192.168.1.5:50010]}
Looking at the DN logs, I see this:
2014-02-06 12:48:35,798 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: NameNode at host2/192.168.1.3:9000 calls recoverBlock(BP-901421341-192.168.1.3-1389719663617:blk_1076582290_2869721, targets=[192.168.1.5:50010], newGenerationStamp=2880680) 2014-02-06 12:48:35,798 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: initReplicaRecovery: blk_1076582290_2869721, recoveryId=2880680, replica=ReplicaBeingWritten, blk_1076582290_2869721, RBW getNumBytes() = 634417185 getBytesOnDisk() = 634417113 getVisibleLength()= 634417113 getVolume() = /srv/hdfs4/hadoop/dn/current getBlockFile() = /srv/hdfs4/hadoop/dn/current/BP-901421341-192.168.1.3-1389719663617/current/rbw/blk_1076582290 bytesAcked=634417113 bytesOnDisk=634417113
I'm guessing that the /srv/hdfs4 partition was filled up, and disagreement about the size of the file and the size the DN thinks the file should be is causing failures.
Restarting HDFS made no difference.
I manually copied the block up into HDFS as the WAL to make any progress.