[HDFS-7443] Datanode upgrade to BLOCKID_BASED_LAYOUT fails if duplicate block files are present in the same volume - ASF JIRA

Log work

Agile Board

Rank to Top

Rank to Bottom

Attach files

Attach Screenshot

Bulk Copy Attachments

Bulk Move Attachments

Voters

Watch issue

Watchers

Create sub-task

Convert to sub-task

Move

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 2.6.0
Fix Version/s: 2.6.1, 3.0.0-alpha1
Component/s: None
Labels:
- 2.6.1-candidate

Target Version/s:

2.6.1

Description

When we did an upgrade from 2.5 to 2.6 in a medium size cluster, about 4% of datanodes were not coming up. They treid data file layout upgrade for BLOCKID_BASED_LAYOUT introduced in ~~HDFS-6482~~, but failed.

All failures were caused by NativeIO.link() throwing IOException saying EEXIST. The data nodes didn't die right away, but the upgrade was soon retried when the block pool initialization was retried whenever BPServiceActor was registering with the namenode. After many retries, datenodes terminated. This would leave previous.tmp and current with no VERSION file in the block pool slice storage directory.

Although previous.tmp contained the old VERSION file, the content was in the new layout and the subdirs were all newly created ones. This shouldn't have happened because the upgrade-recovery logic in Storage removes current and renames previous.tmp to current before retrying. All successfully upgraded volumes had old state preserved in their previous directory.

In summary there were two observed issues.

Upgrade failure with link() failing with EEXIST
previous.tmp contained not the content of original current, but half-upgraded one.

We did not see this in smaller scale test clusters.

Attachments

HDFS-7443.001.patch
18/Dec/14 22:00
1.57 MB
Colin McCabe
HDFS-7443.002.patch
19/Dec/14 04:42
1.57 MB
Colin McCabe

Issue Links

Add Link

is related to

HADOOP-11483 HardLink.java should use the jdk7 createLink method

Closed

Delete this link

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Colin McCabe Assign to me

Reporter:: Kihwal Lee

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 25/Nov/14 14:17

Updated:: 30/Aug/16 01:40

Resolved:: 19/Dec/14 21:26

Agile

View on Board

Datanode upgrade to BLOCKID_BASED_LAYOUT fails if duplicate block files are present in the same volume

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Agile

Slack

Issue deployment