At its core, this patch uses the same main idea than
instead of using a Clock, it splits the subparts of the context of
CASSANDRA-1072 into individual columns. This has, at the very least, two
advantages over the clock structure:
- it is much simpler in terms of code as its reuse of lot of what exists. In
particular, it avoids the fairly complex byte array manipulation of
- each parts of the counter (that is, each pair (host id, count)) have a
timestamp. This means we get full counters (increment and decrement) for
free (that alone is worth it if you ask me).
More anecdotally, this allows for a different approach than the one of
CASSANDRA-1072 for the 'cleanContext' logic, which I believe is faster and
cleaner (and hopefully right). This is more of an implementation detail
however, so we can discuss it if someone find a problem.
The patch also includes some modification of the patch I had once attached to
CASSANDRA-580 and CASSANDRA-1397. It adds a new verb handler when submitting a
counter update that first forward the update to an actual replica of the
counter. This replica, applies the update, and then replicate the update. So
this patch does support the usual consistency level (for read and writes).
This involves a local read on the first replicate (called leader in the
patch), but at least, the patch ensures that at CL.ONE, this read happens
after the write is acked to the client.
In this patch, a counter is thus a (small) bunch of columns (one by replica),
and hence a counter is internally a row. It could have been a super column,
but it's not, for reasons that follow. But let me add that when (in 0.8 maybe?)
we have super columns that we don't need to deserialize fully, it will make a
lot of sense to change the counter to be super columns instead (and it won't
CASSANDRA-1072 has a fairly important problem: it is a bit too much
unreliable. If you do a write and for some reason (TimeoutException or loss of
connection to the coordinator) this doesn't succeed, you have NO WAY to know
if the increment have been counted or not. Nothing really new, standard writes
have the same problem, but with standard writes, you just replay the update.
CASSANDRA-1072, you can't because that can mean incrementing the value
twice .. or not. Given that TimeoutException can really happens, this is far
from perfect, even though I perfectly understand that some can live with that.
In this patch, I propose an improvement that, while maybe not perfect, improve
this situation a lot. The idea is to tag each counter update with a unique id
(typically a uuid, even though the implementation doesn't assume it). If an
update timeout, you simply replay it with the same uuid and the system will
make sure it didn't get applied twice. To do that, when a new update for a
counter is applied, a marked column with the provided uuid will be inserted in
the same row (remember, counters are rows). But before even applying the read,
we try to read the marker. If we find one, we just skip the update, otherwise
we apply (and insert the marker). The details are more complicated, in
particular because we don't know if the same 'leader' will be chosen for the
replay of the update, but I let people interested look at the implementation
for details. Those marker columns are expiring column, so stay only for some
time (gcGrace in the patch, that can change). You can choose your TTL here,
choosing the window of time during which you can replay a failed update
CASSANDRA-1537 will make a lot of sense for those marker columns).
Note that those marker are only used on writes. The read path only read the
columns that holds the counts. So the read performance is as good as it gets.
Let's sum up: this means 2 reads on the write path, one for the marker and one
before replication. BUT:
- the marker is totally optional. If you don't provide one, you will not be
able to replay a failed update, but if you can live with that, you can.
- all those columns (for a given counter), being in the same row, you can
assign a fair amount of row cache and make those read fairly painless.
For the actual user of
CASSANDRA-1072, it means that if you don't use uuids
and you write at CL.ONE, you should have virtually the same write performances
with this patch.
All this means that all operation on counters goes trough new thrift API calls.
The patch is split in 3 parts. The first one remove the clock structure (it is
CASSANDRA-1502 in fact). The second part is the actual patch. The third is the
thrift generated file changes (I join it for convenience sake only, nothing
worth reading there).
Now the less good news. This patch is really just out of the oven (written
over the course of the week-end). It is clearly under-tested. Actually, the
extends of my test are the system tests included in the patch. That is: no test
on multiple hosts or stress tests yet. I will work on those, but I felt
the patch was in a good enough shape that it was worth sharing and it was a
good time to get feedbacks.
The patch also has a few know problems:
- Even though I include a system test for it, the remove is broken. In some
case, where an update issued after a delete gets in before the delete, the
delete could be wrongly discarded. For what it's worth, I'm pretty sure
CASSANDRA-1072 also suffers from the exact same problem. I have to think
more about that, but any idea is welcome.
- There is no hints for the CounterMutation (cf. the patch). This makes the
marker idea more fragile that it should. Those should be added at some
point. It's also worth noting that the maker idea is not bulletproof and
will probably never be. But it's my strong believe that it's much better
- It could be that streaming is broken with this (for the same reason that
CASSANDRA-1072 has code for this). It could be that it's not. I'm not
hugely familiar with this part of the code. I have to check and fix it if
- Because it's not the coordinator but the first replica (leader) that does
the replication, we could have to send UnavailableException back to the
coordinator (and then back to the client). It's almost done but not
I will try to address all of what's above asap. But I will have very little
time to give to this during this week, so any help (in tests and reviews in
particular) will be appreciated. All opinions, remarks, criticisms are also