Here's my initial though on that:
First, I think that whatever we come up with will have intrinsic limitations. What we want, is being able to retry a failed increment, with the guarantee that the retry will only be applied if the initial increment was not. By failed increment, we mean here one of:
- the client got a TimeoutException back
- the coordinator died and the client got some broken pipe error
- a bug made the coordinator return a TApplicationException("some unexpected shit happened")
When that happens, different things can have happened. One possible scenario is that the first replica (let's call him A) received the increment, did persist it on disk, but then failed before having replicated it. If that happens, we end up in a situation where until A is brought back up, we cannot decide whether a retry should be discarded or actually retried. Because we cannot know whether A died just before persisting the increment or just after.
Which leads me to think that whatever idea we have for this, it will likely have one of the two following drawback:
- either retry will be limited to CL.ALL (fairly useless in my opinion)
- or we accept the retry at any CL, but have a way to eventually detect when both the initial increment and it's retry have been applied, and have a way to repair when that happens. Which quite probably imply that we will have over-count, but with the guarantee that they will be eventually repaired.
Of course, there can be better solutions that I don't see.
Anyway, I had tried to implement the second idea (the eventual repair) back in the days on
CASSANDRA-1546. In particular, I'm attaching to this issue the txt file (marker_idea) from there that was supposed to explain how this should work. The code in CASSANDRA-1546 is also supposed to implement this idea, so more details on the specifics could be found there if the text file is not so clean. Unfortunately, when I though about porting this idea to the current code, I realized that it had corner cases it wasn't handling well: in some situation the complete death of a node was problematic and I haven't found a good solution so far. So the whole idea may or may not be a good starting point.