Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
CDH5.7.4
-
Reviewed
-
Fixed a race condition that caused VolumeScanner to recognize a good replica as a bad one if the replica is also being written concurrently.
Description
Due to a race condition initially reported in HDFS-6804, VolumeScanner may erroneously detect good replicas as corrupt. This is serious because in some cases it results in data loss if all replicas are declared corrupt. This bug is especially prominent when there are a lot of append requests via HttpFs/WebHDFS.
We are investigating an incidence that caused very high block corruption rate in a relatively small cluster. Initially, we thought HDFS-11056 is to blame. However, after applying HDFS-11056, we are still seeing VolumeScanner reporting corrupt replicas.
It turns out that if a replica is being appended while VolumeScanner is scanning it, VolumeScanner may use the new checksum to compare against old data, causing checksum mismatch.
I have a unit test to reproduce the error. Will attach later. A quick and simple fix is to hold FsDatasetImpl lock and read from disk the checksum.
Attachments
Attachments
Issue Links
- breaks
-
HDFS-12136 BlockSender performance regression due to volume scanner edge case
- Resolved
- depends upon
-
HDFS-11229 HDFS-11056 failed to close meta file
- Resolved
- is depended upon by
-
HDFS-11187 Optimize disk access for last partial chunk checksum of Finalized replica
- Resolved
- is duplicated by
-
HDFS-6804 Add test for race condition between transferring block and appending block causes "Unexpected checksum mismatch exception"
- Resolved
- is related to
-
HDFS-6804 Add test for race condition between transferring block and appending block causes "Unexpected checksum mismatch exception"
- Resolved
-
HDFS-11354 TestBlockScanner#testAppendWhileScanning should shutdown the MiniDFSCluster
- Patch Available
- relates to
-
HDFS-11022 DataNode unable to remove corrupt block replica due to race condition
- Open
-
HDFS-11229 HDFS-11056 failed to close meta file
- Resolved