Description
In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes disk bad track may cause data loss.
For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs on A's replica data, and someday B and C crushed at the same time, NN will try to replicate data from A but failed, this block is corrupt now but no one knows, because NN think there is at least 1 healthy replica and it keep trying to replicate it.
When reading a replica which have data on bad track, OS will return an EIO error, if DN reports the bad block as soon as it got an EIO, we can find this case ASAP and try to avoid data loss
Attachments
Attachments
Issue Links
- relates to
-
HDFS-14752 backport HDFS-13709 to branch-2(Report bad block to NN when transfer block encounter EIO exception)
- Patch Available