Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 1.2.1
    • Component/s: Core
    • Labels:
      None

      Description

      There are many cases where getRangeSlice creates more
      RangeSliceCommand than it should, because it always creates one for each range
      returned by getRestrictedRange. Especially for CL.ONE this does not take
      the replication factor into account and is potentially pretty wasteful.
      A range slice at CL.ONE on a 3 node cluster with RF=3 should only
      ever create one RangeSliceCommand.

      1. 0001-CASSANDRA-4858.patch
        7 kB
        Vijay
      2. 0001-CASSANDRA-4858-v2.patch
        7 kB
        Vijay
      3. 4858-cleanup.txt
        32 kB
        Jonathan Ellis
      4. 4858-v3-1.txt
        34 kB
        Sylvain Lebresne
      5. 4858-v3-2.txt
        3 kB
        Sylvain Lebresne
      6. 4858-v4.txt
        72 kB
        Sylvain Lebresne

        Issue Links

          Activity

          Hide
          Jonathan Ellis added a comment -
          Show
          Jonathan Ellis added a comment - Created CASSANDRA-5354
          Hide
          Jonathan Ellis added a comment -

          It looks like this broke CASSANDRA-833 again: blockFor doesn't take into account pendingEndpoints anymore.

          (Should probably create a new ticket to fix, but it doesn't look like my jira client can do that offline so this will serve as a reminder.)

          Show
          Jonathan Ellis added a comment - It looks like this broke CASSANDRA-833 again: blockFor doesn't take into account pendingEndpoints anymore. (Should probably create a new ticket to fix, but it doesn't look like my jira client can do that offline so this will serve as a reminder.)
          Hide
          Sylvain Lebresne added a comment -

          You're right, we shouldn't merge if the right is the minimum token to avoid creating wrapping ranges since CFS.getRangeSlice don't know what to do with them. Committed simple fix as 19972bdd91d6f8117eb8baa957b547000e995de3.

          Show
          Sylvain Lebresne added a comment - You're right, we shouldn't merge if the right is the minimum token to avoid creating wrapping ranges since CFS.getRangeSlice don't know what to do with them. Committed simple fix as 19972bdd91d6f8117eb8baa957b547000e995de3.
          Hide
          Vijay added a comment -

          This patch causes

          ERROR [ReadStage:80] 2013-01-17 14:48:54,829 CassandraDaemon.java (line 133) Exception in thread Thread[ReadStage:80,5,main]
          java.lang.AssertionError: (max(0),max(0)]
          at org.apache.cassandra.db.ColumnFamilyStore.getSequentialIterator(ColumnFamilyStore.java:1352)
          at org.apache.cassandra.db.ColumnFamilyStore.getRangeSlice(ColumnFamilyStore.java:1411)
          at org.apache.cassandra.service.RangeSliceVerbHandler.executeLocally(RangeSliceVerbHandler.java:46)
          at org.apache.cassandra.service.StorageProxy$LocalRangeSliceRunnable.runMayThrow(StorageProxy.java:1077)
          at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1562)
          at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
          at java.lang.Thread.run(Thread.java:680)
          
          

          Because the range merge merges Ranges like:

          [(max(0),max(88065836015670275036743470843252762100)], (max(88065836015670275036743470843252762100),min(-1)], (min(-1),max(0)]]
          

          to

          (max(0),max(0)]
          
          Show
          Vijay added a comment - This patch causes ERROR [ReadStage:80] 2013-01-17 14:48:54,829 CassandraDaemon.java (line 133) Exception in thread Thread [ReadStage:80,5,main] java.lang.AssertionError: (max(0),max(0)] at org.apache.cassandra.db.ColumnFamilyStore.getSequentialIterator(ColumnFamilyStore.java:1352) at org.apache.cassandra.db.ColumnFamilyStore.getRangeSlice(ColumnFamilyStore.java:1411) at org.apache.cassandra.service.RangeSliceVerbHandler.executeLocally(RangeSliceVerbHandler.java:46) at org.apache.cassandra.service.StorageProxy$LocalRangeSliceRunnable.runMayThrow(StorageProxy.java:1077) at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1562) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang. Thread .run( Thread .java:680) Because the range merge merges Ranges like: [(max(0),max(88065836015670275036743470843252762100)], (max(88065836015670275036743470843252762100),min(-1)], (min(-1),max(0)]] to (max(0),max(0)]
          Hide
          Sylvain Lebresne added a comment -

          Sorry, that was a stupid typo. Fixed in commit afed8ccddf2f5f32eebfa04a697da15fd3e07d2d.

          Show
          Sylvain Lebresne added a comment - Sorry, that was a stupid typo. Fixed in commit afed8ccddf2f5f32eebfa04a697da15fd3e07d2d.
          Hide
          Brandon Williams added a comment -

          Reopening because this broke the dtests

          Show
          Brandon Williams added a comment - Reopening because this broke the dtests
          Hide
          Sylvain Lebresne added a comment -

          Committed, thanks

          Show
          Sylvain Lebresne added a comment - Committed, thanks
          Hide
          Jonathan Ellis added a comment -

          +1

          nit: only objected to "intersection" when it wasn't actually computing [just] an intersection, that is a fine name for the method now

          Show
          Jonathan Ellis added a comment - +1 nit: only objected to "intersection" when it wasn't actually computing [just] an intersection, that is a fine name for the method now
          Hide
          Sylvain Lebresne added a comment -

          On the snitch heuristics, can we just say "if latency(b) > latency(a) + latency(c), then don't merge?"

          That's kind of what we want, though we want to do that on the final list of replica that is going to be queried (for the first range, the 2nd one and their union). And because it is currently ReadCallback (created after we've made the decision of merging or not the ranges) that compute that final list of endpoints, I've been too lazy to do that properly. But that was a mistake, so attaching a v4 that refactor things a little more (pulling the filtering of the final endpoints out of ReadCallback basically) to do this ticket properly.

          With that, we end up having the dynamic snitch heuristic being something like:

          if (maxScore(endpointsFor(A and B)) > maxScore(endpointsFor(A)) + maxScore(endpointsFor(B)))
              // don't merge
          

          It's maxScore since we're dealing with lists of endpoints and it's the max latency that will define the total latency of the read.

          I note that this v4 includes Jonathan's cleanup. As commented in that cleanup patch, there is probably simplification to be gained by moving the Table inside ConsistencyLevel, but let's maybe do that in a follow-up ticket?

          Show
          Sylvain Lebresne added a comment - On the snitch heuristics, can we just say "if latency(b) > latency(a) + latency(c), then don't merge?" That's kind of what we want, though we want to do that on the final list of replica that is going to be queried (for the first range, the 2nd one and their union). And because it is currently ReadCallback (created after we've made the decision of merging or not the ranges) that compute that final list of endpoints, I've been too lazy to do that properly. But that was a mistake, so attaching a v4 that refactor things a little more (pulling the filtering of the final endpoints out of ReadCallback basically) to do this ticket properly. With that, we end up having the dynamic snitch heuristic being something like: if (maxScore(endpointsFor(A and B)) > maxScore(endpointsFor(A)) + maxScore(endpointsFor(B))) // don't merge It's maxScore since we're dealing with lists of endpoints and it's the max latency that will define the total latency of the read. I note that this v4 includes Jonathan's cleanup. As commented in that cleanup patch, there is probably simplification to be gained by moving the Table inside ConsistencyLevel, but let's maybe do that in a follow-up ticket?
          Hide
          Jonathan Ellis added a comment -

          Main patch LGTM. Attaching cleanup that renames interesection to mergeForRangeQuery, and changes a lot of "String table" parameters to actual Table objects.

          On the snitch heuristics, can we just say "if latency(b) > latency(a) + latency(c), then don't merge?"

          Show
          Jonathan Ellis added a comment - Main patch LGTM. Attaching cleanup that renames interesection to mergeForRangeQuery, and changes a lot of "String table" parameters to actual Table objects. On the snitch heuristics, can we just say "if latency(b) > latency(a) + latency(c), then don't merge?"
          Hide
          Vijay added a comment -

          +1 otherwise, this is better than what we have right now with vnodes. (We can also address configurable threshold in a different ticket IMHO).

          Show
          Vijay added a comment - +1 otherwise, this is better than what we have right now with vnodes. (We can also address configurable threshold in a different ticket IMHO).
          Hide
          Sylvain Lebresne added a comment -

          Can we use dynamic_snitch_badness_threshold or add something similar instead?

          I'm not sure reusing the dynamic_snitch_badness_threshold is appropriate here. Typically, the default dynamic_snitch_badness_threshold of 0.1 feels like a fairly bad value for the concern here. And besides, dynamic_snitch_badness_threshold has it's own use that is separate. We can of course add a new setting here, though 1) it's somewhat pushing the burden to the user and 2) if we create a setting, it should be a setting that make sense and I'm not sure exactly what that would be (in particular, I'm far from claiming the heuristic of my second patch is perfect (it is possible it is "good enough" however)).

          Show
          Sylvain Lebresne added a comment - Can we use dynamic_snitch_badness_threshold or add something similar instead? I'm not sure reusing the dynamic_snitch_badness_threshold is appropriate here. Typically, the default dynamic_snitch_badness_threshold of 0.1 feels like a fairly bad value for the concern here. And besides, dynamic_snitch_badness_threshold has it's own use that is separate. We can of course add a new setting here, though 1) it's somewhat pushing the burden to the user and 2) if we create a setting, it should be a setting that make sense and I'm not sure exactly what that would be (in particular, I'm far from claiming the heuristic of my second patch is perfect (it is possible it is "good enough" however)).
          Hide
          Vijay added a comment -

          'twice' here is a bit random and might not be appropriate

          Can we use dynamic_snitch_badness_threshold or add something similar instead?

          Show
          Vijay added a comment - 'twice' here is a bit random and might not be appropriate Can we use dynamic_snitch_badness_threshold or add something similar instead?
          Hide
          Sylvain Lebresne added a comment -

          but scores are really helped in the past returning the queries within SLA's

          I agree, and the goal of the patch is certainly not to deny that. The question is, if you can merge 2 queries into 1, at which point do you decide it's not worth it based on the endpoints scores. I think it is safe to say that in most case, picking 1 query over 2 will be the best option. That being said, if you do happen to have a very very slow node that the intersection would pick while doing 2 query would only pick very fast nodes, then you might be worth off doing the 1 query. So along with a rebase of the previous patch I'm attaching a 2nd patch that adds an heuristic to the dynamic snitch that tries to avoid that. The heuristic being that if the average score of the intersection of 2 replica set is twice as bad (I'll admit that the 'twice' here is a bit random and might not be appropriate) as the average score of one of the replica set, then we ignore the intersection. It's probably not the best heuristic though, but it has the advantage of being simple.

          Show
          Sylvain Lebresne added a comment - but scores are really helped in the past returning the queries within SLA's I agree, and the goal of the patch is certainly not to deny that. The question is, if you can merge 2 queries into 1, at which point do you decide it's not worth it based on the endpoints scores. I think it is safe to say that in most case, picking 1 query over 2 will be the best option. That being said, if you do happen to have a very very slow node that the intersection would pick while doing 2 query would only pick very fast nodes, then you might be worth off doing the 1 query. So along with a rebase of the previous patch I'm attaching a 2nd patch that adds an heuristic to the dynamic snitch that tries to avoid that. The heuristic being that if the average score of the intersection of 2 replica set is twice as bad (I'll admit that the 'twice' here is a bit random and might not be appropriate) as the average score of one of the replica set, then we ignore the intersection. It's probably not the best heuristic though, but it has the advantage of being simple.
          Hide
          Vijay added a comment -

          Refactor LGTM, but scores are really helped in the past returning the queries within SLA's (may be it is reasonable for RR but not sure).

          Show
          Vijay added a comment - Refactor LGTM, but scores are really helped in the past returning the queries within SLA's (may be it is reasonable for RR but not sure).
          Hide
          Sylvain Lebresne added a comment -

          I'm afraid the endpoint inclusion as done by v2 is not as efficient as could be. Consider a 5 nodes, RF=3, no DC and query at CL.ONE setup. As it happens, the first endpoint for any given range won't be in the list of endpoint for the next range. So we'll end up merging no range and doing 5 range queries, even though 2 would be enough to cover the whole range.

          So to minimize the number of range queried I'm pretty sure the best option is for a given range to consider the intersection of its endpoints and the ones of the next range. I'm attaching a v3 patch that implements what I have in mind.

          I note that this v3 pull the logic that compute whether a list of live endpoint can fulfill a given consistency level from ReadCallback and WriteResponseHandler into the ConsistencyLevel class. The reason is that patch needs that logic before the ReadCallback has been created. But I think this is a good refactor as this logic belong to ConsistencyLevel anyway.

          This made me realise there is a complication however, which is that we probably need to take datacenters and maybe even the endpoint latency scores into account. Say a range has for replica [A, B] and the next range has replica [B, C] and CL == ONE. You could merge both range and send the request to B, but if say B is in a remote datacenter while A and C are in the local one, maybe doing 2 queries to A and C would actually be better. Same if B is local but very very slow. To try to handle that, the v3 patch move that decision to the snitch and the default implementation consider only endpoints in the localDC in the intersection of endpoints used to decided whether we can/should merge two consecutive ranges. We could then have the dymanic snitch do something special, like not consider endpoint with a very bad latency score when computing the intersection, but I haven't implemented that yet, because it's unclear to me where to draw the limit.

          I've done a few quick tests with this patch. For a 5 nodes, RF=3, no DC setup, without the patch we query 5 ranges at CL.ONE and 10 at CL.QUORUM to cover the full ring (SELECT * FROM foo). With the patch, we query 2 ranges at CL.ONE and 6 at CL.QUORUM. And as expected, in the vnodes case with in a single node setup, the same SELECT * requires only 1 internal query instead of 256.

          Show
          Sylvain Lebresne added a comment - I'm afraid the endpoint inclusion as done by v2 is not as efficient as could be. Consider a 5 nodes, RF=3, no DC and query at CL.ONE setup. As it happens, the first endpoint for any given range won't be in the list of endpoint for the next range. So we'll end up merging no range and doing 5 range queries, even though 2 would be enough to cover the whole range. So to minimize the number of range queried I'm pretty sure the best option is for a given range to consider the intersection of its endpoints and the ones of the next range. I'm attaching a v3 patch that implements what I have in mind. I note that this v3 pull the logic that compute whether a list of live endpoint can fulfill a given consistency level from ReadCallback and WriteResponseHandler into the ConsistencyLevel class. The reason is that patch needs that logic before the ReadCallback has been created. But I think this is a good refactor as this logic belong to ConsistencyLevel anyway. This made me realise there is a complication however, which is that we probably need to take datacenters and maybe even the endpoint latency scores into account. Say a range has for replica [A, B] and the next range has replica [B, C] and CL == ONE. You could merge both range and send the request to B, but if say B is in a remote datacenter while A and C are in the local one, maybe doing 2 queries to A and C would actually be better. Same if B is local but very very slow. To try to handle that, the v3 patch move that decision to the snitch and the default implementation consider only endpoints in the localDC in the intersection of endpoints used to decided whether we can/should merge two consecutive ranges. We could then have the dymanic snitch do something special, like not consider endpoint with a very bad latency score when computing the intersection, but I haven't implemented that yet, because it's unclear to me where to draw the limit. I've done a few quick tests with this patch. For a 5 nodes, RF=3, no DC setup, without the patch we query 5 ranges at CL.ONE and 10 at CL.QUORUM to cover the full ring (SELECT * FROM foo). With the patch, we query 2 ranges at CL.ONE and 6 at CL.QUORUM. And as expected, in the vnodes case with in a single node setup, the same SELECT * requires only 1 internal query instead of 256.
          Hide
          Vijay added a comment -

          To be honest i didn't read the sudo code before starting the patch (My bad).

          range1 handler.endpoints (live endpoints but after filtering for RR and CL) is compared against range2 livenodes. If range1 handler.endpoints are in the range2 live nodes then we dont need to span additional requests, this should cover the RR and CL cases explained above. Similar to the sudo code.

          Show
          Vijay added a comment - To be honest i didn't read the sudo code before starting the patch (My bad). range1 handler.endpoints (live endpoints but after filtering for RR and CL) is compared against range2 livenodes. If range1 handler.endpoints are in the range2 live nodes then we dont need to span additional requests, this should cover the RR and CL cases explained above. Similar to the sudo code.
          Hide
          Sylvain Lebresne added a comment -

          Is there a reason why this is limited to CL.ONE. Cause the problem is not necessarily limited to CL.ONE. Typically, in the degenerated case where RF == number of nodes, only one command need to be created ever, even for CL.ALL. More generally, as in the snippet of pseudo-code above, I think that as long as the intersection of endpoints for the current range and of endpoints for the next range is greater than what the consistency level require, then we can merge both range in one command. I'm also not sure why we would limit this to when RR is not true as long as the condition I've just described is true.

          As as side note, there is a typo in IncludingExcludingBounds.cloneWithRight (right is used twice).

          Show
          Sylvain Lebresne added a comment - Is there a reason why this is limited to CL.ONE. Cause the problem is not necessarily limited to CL.ONE. Typically, in the degenerated case where RF == number of nodes, only one command need to be created ever, even for CL.ALL. More generally, as in the snippet of pseudo-code above, I think that as long as the intersection of endpoints for the current range and of endpoints for the next range is greater than what the consistency level require, then we can merge both range in one command. I'm also not sure why we would limit this to when RR is not true as long as the condition I've just described is true. As as side note, there is a typo in IncludingExcludingBounds.cloneWithRight (right is used twice).
          Hide
          Vijay added a comment -

          Simple and a hackie patch,

          We check if CL.One and RR is not true (FBU.threadLocalRandom().nextDouble()), then we will use this optimization.

          Show
          Vijay added a comment - Simple and a hackie patch, We check if CL.One and RR is not true (FBU.threadLocalRandom().nextDouble()), then we will use this optimization.
          Hide
          Sylvain Lebresne added a comment -

          It is true that for a given subpart of the ring, you'll need more nodes to cover it with vnodes than without in general. But I think that's just one downside of vnodes and there is little we can do about, and more importantly, I don't think we should focus on that on this ticket.

          What we're trying to fix here is that when a node is primary replica for a range r, then he is also replica for RF-1 other range contiguous to r. That's true with vnodes or not. So to cover (at CL.ONE) the range composed of those RF contiguous ranges, we only need one RangeSliceCommand, but currently we create RF ones. In other words, we currently create RF times too many RangeSliceCommand because we don't take into account the replication factor.

          And as it happens, in the degenerate case where RF == number of nodes, this is even worst than that because in that case 1 RangeSliceCommand is always enough for CL.ONE, but we create RF * nb-tokens-per-node ones, and that's where 'vnodes make this a lot worse'. But in general this ticket is not really related to vnodes (but yes, range queries potentially suck more with vnodes, but that's a completely different problem (though in practice I'm not sure it really matters)).

          Show
          Sylvain Lebresne added a comment - It is true that for a given subpart of the ring, you'll need more nodes to cover it with vnodes than without in general. But I think that's just one downside of vnodes and there is little we can do about, and more importantly, I don't think we should focus on that on this ticket. What we're trying to fix here is that when a node is primary replica for a range r, then he is also replica for RF-1 other range contiguous to r. That's true with vnodes or not. So to cover (at CL.ONE) the range composed of those RF contiguous ranges, we only need one RangeSliceCommand, but currently we create RF ones. In other words, we currently create RF times too many RangeSliceCommand because we don't take into account the replication factor. And as it happens, in the degenerate case where RF == number of nodes, this is even worst than that because in that case 1 RangeSliceCommand is always enough for CL.ONE, but we create RF * nb-tokens-per-node ones, and that's where 'vnodes make this a lot worse'. But in general this ticket is not really related to vnodes (but yes, range queries potentially suck more with vnodes, but that's a completely different problem (though in practice I'm not sure it really matters)).
          Hide
          Vijay added a comment -

          The problem is that we have to scan the nodes in token order so we dont break the existing API's, if we do so then we are sending a lot more requests and waiting for the response than the number of nodes.

          Its highly unlikely we will be able to query contiguous ranges from the same node. Still thinking of a better way....

          Show
          Vijay added a comment - The problem is that we have to scan the nodes in token order so we dont break the existing API's, if we do so then we are sending a lot more requests and waiting for the response than the number of nodes. Its highly unlikely we will be able to query contiguous ranges from the same node. Still thinking of a better way....
          Hide
          Jonathan Ellis added a comment -

          vnodes make this a lot worse.

          Show
          Jonathan Ellis added a comment - vnodes make this a lot worse.
          Hide
          Jonathan Ellis added a comment -

          Sylvain says, "I don't think there is anything to win in doing something more complex than (in very pseudo code):"

            ranges = new ArrayQueue(getRestrictedRanges(command.range));
            while (!ranges.isEmpty())
            {
                range = ranges.poll();
                endpoints = endpointsFor(range);
                while (intersection(endpoints, endpointsFor(ranges.peek())) >=
          requiredByCL)
                {
                    nextRange = ranges.poll();
                    range = union(range, nextRange);
                    endpoints = intersection(endpoints, endpointsFor(nextRange));
                }
                send range to endpoints
            }
          
          Show
          Jonathan Ellis added a comment - Sylvain says, "I don't think there is anything to win in doing something more complex than (in very pseudo code):" ranges = new ArrayQueue(getRestrictedRanges(command.range)); while (!ranges.isEmpty()) { range = ranges.poll(); endpoints = endpointsFor(range); while (intersection(endpoints, endpointsFor(ranges.peek())) >= requiredByCL) { nextRange = ranges.poll(); range = union(range, nextRange); endpoints = intersection(endpoints, endpointsFor(nextRange)); } send range to endpoints }

            People

            • Assignee:
              Sylvain Lebresne
              Reporter:
              Jonathan Ellis
              Reviewer:
              Vijay
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development