[HDFS-14941] Potential editlog race condition can cause corrupted file - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.3.0, 3.2.2, 2.10.1
Component/s: namenode
Labels:
- ha

Hadoop Flags:

Reviewed

Description

Recently we encountered an issue that, after a failover, NameNode complains corrupted file/missing blocks. The blocks did recover after full block reports, so the blocks are not actually missing. After further investigation, we believe this is what happened:

First of all, on SbN, it is possible that it receives block reports before corresponding edit tailing happened. In which case SbN postpones processing the DN block report, handled by the guarding logic below:

      if (shouldPostponeBlocksFromFuture &&
          namesystem.isGenStampInFuture(iblk)) {
        queueReportedBlock(storageInfo, iblk, reportedState,
            QUEUE_REASON_FUTURE_GENSTAMP);
        continue;
      }

Basically if reported block has a future generation stamp, the DN report gets requeued.

However, in FSNamesystem#storeAllocatedBlock, we have the following code:

      // allocate new block, record block locations in INode.
      newBlock = createNewBlock();
      INodesInPath inodesInPath = INodesInPath.fromINode(pendingFile);
      saveAllocatedBlock(src, inodesInPath, newBlock, targets);

      persistNewBlock(src, pendingFile);
      offset = pendingFile.computeFileSize();

The line
newBlock = createNewBlock();
Would log an edit entry OP_SET_GENSTAMP_V2 to bump generation stamp on Standby
while the following line
persistNewBlock(src, pendingFile);
would log another edit entry OP_ADD_BLOCK to actually add the block on Standby.

Then the race condition is that, imagine Standby has just processed OP_SET_GENSTAMP_V2, but not yet OP_ADD_BLOCK (if they just happen to be in different setment). Now a block report with new generation stamp comes in.

Since the genstamp bump has already been processed, the reported block may not be considered as future block. So the guarding logic passes. But actually, the block hasn't been added to blockmap, because the second edit is yet to be tailed. So, the block then gets added to invalidate block list and we saw messages like:

BLOCK* addBlock: block XXX on node XXX size XXX does not belong to any file

Even worse, since this IBR is effectively lost, the NameNode has no information about this block, until the next full block report. So after a failover, the NN marks it as corrupt.

This issue won't happen though, if both of the edit entries get tailed all together, so no IBR processing can happen in between. But in our case, we set edit tailing interval to super low (to allow Standby read), so when under high workload, there is a much much higher chance that the two entries are tailed separately, causing the issue.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-14941.001.patch
03/Nov/19 22:55
14 kB
Konstantin Shvachko
HDFS-14941.002.patch
04/Nov/19 22:39
17 kB
Chen Liang
HDFS-14941.003.patch
05/Nov/19 01:33
18 kB
Chen Liang
HDFS-14941.004.patch
05/Nov/19 18:02
18 kB
Chen Liang
HDFS-14941.005.patch
05/Nov/19 21:55
18 kB
Chen Liang
HDFS-14941.006.patch
05/Nov/19 23:56
19 kB
Chen Liang

Issue Links

breaks

HDFS-15421 IBR leak causes standby NN to be stuck in safe mode

Resolved

contains

HDFS-14792 [SBN read] StanbyNode does not come out of safemode while adding new blocks.

Resolved

relates to

HDFS-17453 IncrementalBlockReport can have race condition with Edit Log Tailer

Resolved

Activity

People

Assignee:: Chen Liang

Reporter:: Chen Liang

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 29/Oct/19 23:38

Updated:: 04/Apr/24 19:39

Resolved:: 06/Nov/19 17:57