[HDFS-11160] VolumeScanner reports write-in-progress replicas as corrupt incorrectly - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.8.0, 2.7.4, 3.0.0-alpha2
Component/s: datanode
Labels:
None
Environment:

CDH5.7.4

Target Version/s:

2.7.4
Hadoop Flags:

Reviewed
Release Note:
Fixed a race condition that caused VolumeScanner to recognize a good replica as a bad one if the replica is also being written concurrently.

Description

Due to a race condition initially reported in ~~HDFS-6804~~, VolumeScanner may erroneously detect good replicas as corrupt. This is serious because in some cases it results in data loss if all replicas are declared corrupt. This bug is especially prominent when there are a lot of append requests via HttpFs/WebHDFS.

We are investigating an incidence that caused very high block corruption rate in a relatively small cluster. Initially, we thought ~~HDFS-11056~~ is to blame. However, after applying ~~HDFS-11056~~, we are still seeing VolumeScanner reporting corrupt replicas.

It turns out that if a replica is being appended while VolumeScanner is scanning it, VolumeScanner may use the new checksum to compare against old data, causing checksum mismatch.

I have a unit test to reproduce the error. Will attach later. A quick and simple fix is to hold FsDatasetImpl lock and read from disk the checksum.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-11160.001.patch
22/Nov/16 04:21
10 kB
Wei-Chiu Chuang
HDFS-11160.002.patch
22/Nov/16 17:37
11 kB
Wei-Chiu Chuang
HDFS-11160.003.patch
01/Dec/16 02:44
17 kB
Yongjun Zhang
HDFS-11160.004.patch
09/Dec/16 19:30
18 kB
Wei-Chiu Chuang
HDFS-11160.005.patch
10/Dec/16 18:21
18 kB
Wei-Chiu Chuang
HDFS-11160.006.patch
12/Dec/16 16:22
18 kB
Wei-Chiu Chuang
HDFS-11160.007.patch
14/Dec/16 22:20
17 kB
Wei-Chiu Chuang
HDFS-11160.008.patch
14/Dec/16 22:32
17 kB
Wei-Chiu Chuang
HDFS-11160.branch-2.patch
15/Dec/16 06:54
21 kB
Wei-Chiu Chuang
HDFS-11160.reproduce.patch
20/Nov/16 06:59
13 kB
Wei-Chiu Chuang

Issue Links

breaks

HDFS-12136 BlockSender performance regression due to volume scanner edge case

Resolved

depends upon

HDFS-11229 HDFS-11056 failed to close meta file

Resolved

is depended upon by

HDFS-11187 Optimize disk access for last partial chunk checksum of Finalized replica

Resolved

is duplicated by

HDFS-6804 Add test for race condition between transferring block and appending block causes "Unexpected checksum mismatch exception"

Resolved

is related to

HDFS-6804 Add test for race condition between transferring block and appending block causes "Unexpected checksum mismatch exception"

Resolved

HDFS-11354 TestBlockScanner#testAppendWhileScanning should shutdown the MiniDFSCluster

Patch Available

relates to

HDFS-11022 DataNode unable to remove corrupt block replica due to race condition

Open

HDFS-11229 HDFS-11056 failed to close meta file

Resolved

(1 is related to, 2 relates to)

Activity

People

Assignee:: Wei-Chiu Chuang

Reporter:: Wei-Chiu Chuang

Votes:: 0 Vote for this issue

Watchers:: 14 Start watching this issue

Dates

Created:: 20/Nov/16 06:59

Updated:: 02/Oct/19 17:14

Resolved:: 16/Dec/16 21:43