Nicolas Favre-Felix, regarding the comments on the gist:
Having "merge cells" means that we could support both TTLs and "put" operations on counters, as long as the semantics are well defined.
Can do "put" writes - true, this would become possible (and trivial - you just write a new merge cell). I don't see how this design would allow for TTLs in a sane way, though (or even insane ones).
… or would we always require QUORUM writes? This could be too restrictive in multi-DC deployments where most people probably prefer to read and write at LOCAL_QUORUM. … When this set is common to all replicas (as is proposed above), we can only merge in QUORUM reads if we can guarantee QUORUM writes or must merge in reads at ALL otherwise.
Yep, this is (too) restrictive, and it's a problem. Both options are too restrictive - we can't require QUORUM writes (esp. in multi-DC clusters), and requiring ALL for merges is also unreasonable, multi-DC or not. This alone makes the (initial) design in the gist unsuitable as-is, IMO.
Now, regarding the first improvement suggestion (sharded replicas):
if we assign a replica deterministically (based on the operation's UUID for example) we risk not being able to write to the counter if that particular replica goes down
And this is unacceptable.
if we assign a replica at random, retrying would lead to duplicates
While not being a total deal-breaker, it still weakens the argument for the set-based design somewhat (enabling external idempotence/client retries is one of the justifications for the extra overhead).
Also, while set collapse becomes a local problem now, what we can actually merge/collapse is just a subset of all updates, not all of them. The reads (and cleanup during compaction) logic (and the collapse logic itself) becomes more complicated now. Despite there being a logical 'owner', the increments (including the merge cells) still need to be replicated (for RF > 1). This means that each replica now has increments it 'owns' and increments it doesn't. So now you can't just read until the first merge cell you meet and stop there, because it's only responsible for a subset of all writes. You have to go on until you read the whole row or a merge cell for every shard there is, and accumulation itself becomes harder, because you have to track whether or not you've already read a shadowing merge cell for the subset a particular increment belongs to and either add it or ignore it.
It becomes possible to read and write at LOCAL_QUORUM using this scheme. As any particular replica is the only source of truth for the subset of deltas that it was assigned, it does by definition read ALL of its deltas and can sum them up with no risk of inconsistency.
You still need the write path special-casing of the current counters - that is, you have to send the request to just one replica, to determine the name of the resulting cell (uuid + owner) and it would replicate the cell to other replicas (this kills internal retry safety btw). If you determine the owner replica first and generate the cell name, and write it at LOCAL_QUORUM, and the write to that owner replica actually fails.. then you can't merge locally, even the owner's subset, you'd need at least LOCAL_QUORUM reads.
So you can't do client retries safely, can't do internal retries safely, have to keep the special write path, and have reads that are pretty much as complex as the current implementation, with the added bonus of extra space and time overhead
True, you can now do merges at LOCAL_ALL (which we don't have, but that's not an issue) or at LOCAL_QUORUM, if you mandate writes at LOCAL_QUORUM. (If the writes are at regular QUORUM or anything other than LOCAL_QUORUM/ALL, then you still need LOCAL_ALL for the merges, because QUORUM doesn't guarantee LOCAL_QUORUM).
A configuration flag per counter CF would configure whether we require W.QUORUM+R.QUORUM (default) or let clients write with any CL with the downside that deltas can only be merged at CL.ALL.
As I wrote above, unless all the writes are at ALL or LOCAL_QUORUM, you can't merge strictly DC-locally So it's actually the choice between [W.LOCAL_QUORUM writes + R.LOCAL_QUORUM merges] or [W.WHATEVER writes + R.LOCAL_ALL merges]. And I'm not a fan of either, even though it's definitely better than requiring ALL.
You also keep the complexity of sharded-partitions design - you can only merge the subset of the updates that belongs to the DC and thus have complicated reads/compaction logic. Oh, and safe idempotent client retries would require drivers to special-case counter-retries somehow - which means they'd have to look inside the query. To determine that it's a counter update and not follow the configured retry strategy.
I do suspect that both of these designs are also sensitive to topology changes. Oh, and with both you can't do 'read-repair-merge' at most read consistency levels, and when you can, these are now also complicated by multiple subsets per counter.
I believe that the space and time overheads are about the same as in Aleksey's design.
Kinda, you now have to encode the owning node (in design #2) or the owning DC (in design #3) along with the update id, or take away some bytes from it.
Now, there is another issue with the design in the gist and the two suggested improvements. We only have two mechanisms to trigger the collapse/merge - do it during reads (and only QUORUM reads for design #1 and at limited levels with #2 and #3) and by building candidate lists during compaction. I'm concerned about hot counters that can generate tons of updates, but not read, or read with the updates still within counter_write_window. The read latency becomes unpredictable with a set-based design. With a particularly hot counter it could even be possible to OOM during the reads. Writes would be faster than with the current implementation, true, but the unpredictable read latency bothers me a lot.
I could also write up why the original Jonathan's suggestion wouldn't work either - "after gc_grace_seconds, rounded up to the nearest hour" does not magically remove the need to synchronize during the collapse, and implementing "preliminary merge" CF is far from trivial.
To summarize, I don't think that set-based designs are viable. They can be made either simple, but with too unreasonable limitations, or nearly as complex as the current implementation, but limitations still too strict, and with added unpredictable read latencies/storage overhead on top.
We should look into potential improvements to the current implementation instead:
1. See if we can simplify it by switching to RMW without killing the speed of current counters (maybe it could be done with some intelligent caching or another not suggested yet optimization);
2. Maybe switch to having two increment-only counters (G-Counters) for each shard, one for increments and one for decrements, if that proves beneficial.
(If we implement both 1) and 2) then we'd have exactly the PN-Counter implementation described in the CRDT whitepaper, btw).