We experienced this in one of our testing cluster under high load.
Error occured for the HBase-RegionServer's WAL file.
1. In HBase there will be multiple threads performing the write,sync and close of same WAL file.
2. Actual writer writes the entries, multiple syncers call hsync on same stream and A roller thread rolls the WALs in regular intervals. i.e. close the curren WAL file and open another one for next entries.
3. During file close() by roller, last block got committed with less size, than present in all DNs.
4. All IBRs reported by DNs have more length, than that of COMMITTED length by the client. So all those replicas are marked as CORRUPT.
5. We use IBR batch with dfs.namenode.file.close.num-committed-allowed=1. So Client(HBase-RS) did not experience any problem, as file got closed successfully without waiting for the Correct IBR for last block.
HDFS-9289, safegaurded the DataStreamer#block's re-assignment during pipeline update by making it volatile. But it did not actually protected the contents of the block.
Suspected problem is:
1. ResponseProcessor updated the block size by updating the numBytes after receiving every Ack by calling ExtendedBlock.setNumBytes(), which internally updates the numBytes of internal block which is not thread safe.
2. LogRoller calls close by by passing DataStreamer#block as last block. During this time, GUESS is that ExtendedBlock.getNumBytes() is not returning the latest value updated by ReponseProcessor, instead returning some of the earlier update. Because ExtendedBlock and its internal block is not threadsafe.
By this lesser size, Block is getting COMMITTED at NameNode and all IBRs are getting marked as CORRUPT.
Make the ExtendedBlock threadsafe for setNumBytes() and getNumBytes().
If the above analysis makes sense, then we can raise one Jira and contribute the fix.
This issue we got in 40-core/380GB-RAM machine thrice. Trying to reproduce again with more logs, but no luck till now.
Once it was reproduced with DEBUG logs as well, from that its confirmed that complete() call is sent only after receiving all ACKs. But DEBUG logs was having no information of numBytes sent during complete(). So could not actually verify that this would be the fix.