Details
-
Bug
-
Status: Resolved
-
Normal
-
Resolution: Fixed
-
None
-
Normal
Description
The problem I want to solve is that I found in our deployment, one slow but alive data node can slow down the whole cluster, even caused timeout of our requests.
We are using DynamicEndpointSnitch, with badness_threshold 0.1. I expect the DynamicEndpointSnitch switch to sortByProximityWithScore, if local data node latency is too high.
I added some debug log, and figured out that in a lot of cases, the score from remote data node was not populated, so the fallback to sortByProximityWithScore never happened. That's why a single slow data node, can cause huge problems to the whole cluster.
In this jira, I'd like to use zero as default score, so that we will get a chance to try remote data node, if local one is slow.
I tested it in our test cluster, it improved the client latency in single slow data node case significantly.
I flag this as a Bug, because it caused problems to our use cases multiple times.
==== logs ===
2018-02-21_23:08:57.54145 WARN 23:08:57 [RPC-Thread:978]: sortByProximityWithBadness: after sorting by proximity, addresses order change to [ip1, ip2], with scores [1.0]
2018-02-21_23:08:57.54319 WARN 23:08:57 [RPC-Thread:967]: sortByProximityWithBadness: after sorting by proximity, addresses order change to [ip1, ip2], with scores [0.0]
2018-02-21_23:08:57.55111 WARN 23:08:57 [RPC-Thread:453]: sortByProximityWithBadness: after sorting by proximity, addresses order change to [ip1, ip2], with scores [1.0]
2018-02-21_23:08:57.55687 WARN 23:08:57 [RPC-Thread:753]: sortByProximityWithBadness: after sorting by proximity, addresses order change to [ip1, ip2], with scores [1.0]
Attachments
Attachments
Issue Links
- relates to
-
CASSANDRA-14555 Verify effect of CASSANDRA-14252 on streaming endpoint selection
- Open