Cassandra
  1. Cassandra
  2. CASSANDRA-2495

Add a proper retry mechanism for counters in case of failed request

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Fix Version/s: None
    • Component/s: Core
    • Labels:
      None

      Description

      Contrarily to standard insert, counter increments are not idempotent. As such, replaying a counter mutation when a TimeoutException occurs could lead to an over-count. This alone limits the use cases for which counters are a viable solution, so we should try to come up with a mechanism that allow the replay of a failed counter mutation without the risk of over-count.

      1. marker_idea.txt
        4 kB
        Sylvain Lebresne

        Activity

        Hide
        Arya Goudarzi added a comment -

        I opened another ticket for Reworking counters backend as you recommended: CASSANDRA-4775

        Show
        Arya Goudarzi added a comment - I opened another ticket for Reworking counters backend as you recommended: CASSANDRA-4775
        Hide
        Jonathan Ellis added a comment -

        I don't think it can be done without a complete reworking of the counters backend. And if that happens it will be a separate ticket, so closing this one.

        Show
        Jonathan Ellis added a comment - I don't think it can be done without a complete reworking of the counters backend. And if that happens it will be a separate ticket, so closing this one.
        Hide
        George Courtsunis added a comment -

        Any update? Would really love to have this feature in place.

        Show
        George Courtsunis added a comment - Any update? Would really love to have this feature in place.
        Hide
        T Jake Luciani added a comment -

        CASSANDRA-2034 does improve this a bit, since we know a hint was stored for the timed out replica(s).

        Show
        T Jake Luciani added a comment - CASSANDRA-2034 does improve this a bit, since we know a hint was stored for the timed out replica(s).
        Hide
        Yang Yang added a comment -

        interesting topic, some thoughts that popped into my mind:

        this ultimately requires that you keep a history of all counter updates, instead of just keeping the current counter value;
        the marker for each add is essentially this history.

        then the issue is : how long do you keep the history? for use cases that do really frequent adds of small deltas, the history can be huge,

        you can limit the number of adds that require a history/marker to be kept for future verification , if you just don't keep a marker if a counter update is completely successful; also for each future update on the same counter, we only write a marker if this is specifically declared to be a retry attempt. ----- there is no need to guarantee that every pair of updates with the same uuidClient has to be de-duped, the client should bear that responsibility.

        Show
        Yang Yang added a comment - interesting topic, some thoughts that popped into my mind: this ultimately requires that you keep a history of all counter updates, instead of just keeping the current counter value; the marker for each add is essentially this history. then the issue is : how long do you keep the history? for use cases that do really frequent adds of small deltas, the history can be huge, you can limit the number of adds that require a history/marker to be kept for future verification , if you just don't keep a marker if a counter update is completely successful; also for each future update on the same counter, we only write a marker if this is specifically declared to be a retry attempt. ----- there is no need to guarantee that every pair of updates with the same uuidClient has to be de-duped, the client should bear that responsibility.
        Hide
        Sylvain Lebresne added a comment -

        Here's my initial though on that:

        First, I think that whatever we come up with will have intrinsic limitations. What we want, is being able to retry a failed increment, with the guarantee that the retry will only be applied if the initial increment was not. By failed increment, we mean here one of:

        • the client got a TimeoutException back
        • the coordinator died and the client got some broken pipe error
        • a bug made the coordinator return a TApplicationException("some unexpected shit happened")

        When that happens, different things can have happened. One possible scenario is that the first replica (let's call him A) received the increment, did persist it on disk, but then failed before having replicated it. If that happens, we end up in a situation where until A is brought back up, we cannot decide whether a retry should be discarded or actually retried. Because we cannot know whether A died just before persisting the increment or just after.

        Which leads me to think that whatever idea we have for this, it will likely have one of the two following drawback:

        1. either retry will be limited to CL.ALL (fairly useless in my opinion)
        2. or we accept the retry at any CL, but have a way to eventually detect when both the initial increment and it's retry have been applied, and have a way to repair when that happens. Which quite probably imply that we will have over-count, but with the guarantee that they will be eventually repaired.

        Of course, there can be better solutions that I don't see.

        Anyway, I had tried to implement the second idea (the eventual repair) back in the days on CASSANDRA-1546. In particular, I'm attaching to this issue the txt file (marker_idea) from there that was supposed to explain how this should work. The code in CASSANDRA-1546 is also supposed to implement this idea, so more details on the specifics could be found there if the text file is not so clean. Unfortunately, when I though about porting this idea to the current code, I realized that it had corner cases it wasn't handling well: in some situation the complete death of a node was problematic and I haven't found a good solution so far. So the whole idea may or may not be a good starting point.

        Show
        Sylvain Lebresne added a comment - Here's my initial though on that: First, I think that whatever we come up with will have intrinsic limitations. What we want, is being able to retry a failed increment, with the guarantee that the retry will only be applied if the initial increment was not. By failed increment, we mean here one of: the client got a TimeoutException back the coordinator died and the client got some broken pipe error a bug made the coordinator return a TApplicationException("some unexpected shit happened") When that happens, different things can have happened. One possible scenario is that the first replica (let's call him A) received the increment, did persist it on disk, but then failed before having replicated it. If that happens, we end up in a situation where until A is brought back up, we cannot decide whether a retry should be discarded or actually retried. Because we cannot know whether A died just before persisting the increment or just after. Which leads me to think that whatever idea we have for this, it will likely have one of the two following drawback: either retry will be limited to CL.ALL (fairly useless in my opinion) or we accept the retry at any CL, but have a way to eventually detect when both the initial increment and it's retry have been applied, and have a way to repair when that happens. Which quite probably imply that we will have over-count, but with the guarantee that they will be eventually repaired. Of course, there can be better solutions that I don't see. Anyway, I had tried to implement the second idea (the eventual repair) back in the days on CASSANDRA-1546 . In particular, I'm attaching to this issue the txt file (marker_idea) from there that was supposed to explain how this should work. The code in CASSANDRA-1546 is also supposed to implement this idea, so more details on the specifics could be found there if the text file is not so clean. Unfortunately, when I though about porting this idea to the current code, I realized that it had corner cases it wasn't handling well: in some situation the complete death of a node was problematic and I haven't found a good solution so far. So the whole idea may or may not be a good starting point.

          People

          • Assignee:
            Unassigned
            Reporter:
            Sylvain Lebresne
          • Votes:
            13 Vote for this issue
            Watchers:
            25 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development