Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
2.8.0
-
None
-
None
Description
The DN is extremely susceptible to a slow volume due bad locking practices. DN operations require a fs dataset lock. IO in the dataset lock should not be permissible as it leads to severe performance degradation and possibly (temporarily) missing blocks.
A slow disk will cause pipelines to experience significant latency and timeouts, increasing lock/io contention while cleaning up, leading to more timeouts, etc. Meanwhile, the actor service thread is interleaving multiple lock acquire/releases with xceivers. If many commands are issued, the node may be incorrectly declared as dead.
HDFS-12639 documents that both actors synchronize on the offer service lock while processing commands. A backlogged active actor will block the standby actor and cause it to go dead too.
Attachments
Issue Links
- is related to
-
HDFS-12639 BPOfferService lock may stall all service actors
- Open