Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.0.0-alpha2, 2.8.1
-
None
-
None
-
Reviewed
Description
Following sequence of events can lead to a block underconstruction being considered missing.
- pipeline of 3 DNs, DN1->DN2->DN3
- DN3 has a failing disk so some updates take a long time
- Client writes entire block and is waiting for final ack
- DN1, DN2 and DN3 have all received the block
- DN1 is waiting for ACK from DN2 who is waiting for ACK from DN3
- DN3 is having trouble finalizing the block due to the failing drive. It does eventually succeed but it is VERY slow at doing so.
- DN2 times out waiting for DN3 and tears down its pieces of the pipeline, so DN1 notices and does the same. Neither DN1 nor DN2 finalized the block.
- DN3 finally sends an IBR to the NN indicating the block has been received.
- Drive containing the block on DN3 fails enough that the DN takes it offline and notifies NN of failed volume
- NN removes DN3's replica from the triplets and then declares the block missing because there are no other replicas
Seems like we shouldn't consider uncompleted blocks for replication.
Attachments
Attachments
Issue Links
- breaks
-
HDFS-11818 TestBlockManager.testSufficientlyReplBlocksUsesNewRack fails intermittently
- Resolved
- is related to
-
HDFS-11445 FSCK shows overall health status as corrupt even one replica is corrupt
- Resolved
- relates to
-
HDFS-12641 Backport HDFS-11755 into branch-2.7 to fix a regression in HDFS-11445
- Resolved