[HDFS-1260] 0.20: Block lost when multiple DNs trying to recover it to different genstamps - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 0.20-append
Fix Version/s: 0.20.205.0
Component/s: None
Labels:
None

Description

Saw this issue on a cluster where some ops people were doing network changes without shutting down DNs first. So, recovery ended up getting started at multiple different DNs at the same time, and some race condition occurred that caused a block to get permanently stuck in recovery mode. What seems to have happened is the following:

FSDataset.tryUpdateBlock called with old genstamp 7091, new genstamp 7094, while the block in the volumeMap (and on filesystem) was genstamp 7093
we find the block file and meta file based on block ID only, without comparing gen stamp
we rename the meta file to the new genstamp _7094
in updateBlockMap, we do comparison in the volumeMap by oldblock without wildcard GS, so it does not update volumeMap
validateBlockMetaData now fails with "blk_7739687463244048122_7094 does not exist in blocks map"

After this point, all future recovery attempts to that node fail in getBlockMetaDataInfo, since it finds the _7094 gen stamp in getStoredBlock (since the meta file got renamed above) and then fails since _7094 isn't in volumeMap in validateBlockMetadata

Making a unit test for this is probably going to be difficult, but doable.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

hdfs-1260.txt
24/Jun/10 01:13
7 kB
Todd Lipcon
hdfs-1260.txt
24/Jun/10 00:00
6 kB
Todd Lipcon
HDFS-1260-20S.3.patch
10/Sep/11 02:21
7 kB
Jitendra Nath Pandey
simultaneous-recoveries.txt
25/Aug/10 20:55
465 kB
Todd Lipcon

Issue Links

is related to

HDFS-1231 Generation Stamp mismatches, leading to failed append

Resolved

HDFS-1263 0.20: in tryUpdateBlock, the meta file is renamed away before genstamp validation is done

Resolved

Activity

People

Assignee:: Todd Lipcon

Reporter:: Todd Lipcon

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 23/Jun/10 00:29

Updated:: 19/Oct/11 00:26

Resolved:: 03/Oct/11 21:48