Details
-
New Feature
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.0.0, 3.0.0-alpha-2
-
None
Description
We all knows the WAL sync performance directly affects the RPC process time.
And we use self-designed FanOutOneBlockAsyncDFSOutput to sync WAL entries, which connect straightly to all the block located DNs. But when even one DN of the locations is slow, e.g. some disk hardware failures, the WAL syncs slow. And what's more, the hardware failure detected by the lower layer HDFS system is not so sensitive.
We can detect slow DNs by the ACK time of packets in FanOutOneBlockAsyncDFSOutput, and exclude them when add new blocks after log rolled(rolling log can also be triggered by slow syncs). And shows this info in UI. We can also invalid these excluded DN cache after a duration, to aware the recovery of those DNs.
I think this idea can quickly reduce the influence of slow DNs, and improve the service availability.