only the coordinator of a given batch might be able to replay batches
Right, my (unspecified so far) assumption was we would provide a way to assume responsibility for a removed coordinator's orphaned entries on removetoken/decommission/replacetoken.
when a node A detects that another node B is down, it could check whether it has some batches for B locally and replay them
That makes sense, that would be a lot more timely than waiting to replace B. If we have B delete these when it's done we might not even need to worry about removetoken et al.
I find 1 just a bit too low for a default
Well, it's more complex than that. If we used the BacklogStrategy proposed above, it's really "RF=1+", because we need (1) the coordinator to go down before it can replicate out to the actual data replicas and (2) the backlog shard host to die unrecoverably, to lose data. So there is a much more narrow window in which hardware failure can cause data loss, than in a traditional RF=1 case where if we lose that node at any time from now on, we lose data. So we need at least RF=2 in that case for redundancy, since there is no alternative. But in the backlog case both the coordinator and the client provide redundancy, across a small window of vulnerability.
That said, I think if we just use the normal SP read path for replay purposes, we get arbitrary RF support automatically, so I don't think using 1 as a default and allowing it to be tuned as desired will be a problem.
I fully expect the retry policy for clients to be unchanged
I think this is (a) an important improvement addressing (b) a significant pain among people who have actually looked close enough to realize what the "official" policy is today, and (c) one that we can fix without much difficulty in the context of atomic batches. We use the local commitlog for both atomicity and durability; we can use the distributed batchlog in the same way. (I note in passing that Megastore uses Bigtable rows as a distributed transaction log in a similar fashion.)
"Failed writes leave the cluster in an unknown state" is the most frequent [legitimate] complaint users have about Cassandra, and one that affects evaluations vs master-oriented systems. We can try to educate about the difference between UE failure and TOE not-really-failure until we are blue in the face but we will continue to get hammered for it.
The standard answer to "just retry the operation" isn't really adequate, either. If part of a batch times out, and then the client dies before retry is successful, then we will have no way to recover from the inconsistency (in the ACID sense, not the CAP sense).
Thus, Hector has implemented a client-side commitlog, which helps, but this is neither something every client should need to reimplement, nor is it as durable as what we can provide on the server (since it's always effectively RF=1), nor do we expect client machines to be as robust as the ones in the Cassandra cluster.
Now, we cannot eliminate TOE 100% of the time, but we can come very very close – with the approach I outlined, the only case we need to hand back a TOE for is if the coordinator attempts a backlog write to a believed-to-be-up node, the write fails, and then the coordinator gets partitioned off so that there are no other live backlog targets available. Since we cannot continue, and we don't know if the original attempt succeeded or not, we have to return TOE. So we will (1) dramatically reduce TOE, and (2) the TOE we do hand back will not cause inconsistency if the client dies before it can retry.
I understand the argument that (2) is the really important part, but again, we can deliver (1) without significantly more effort, so I think it's worth doing. It's the difference between "you should still implement client-side retry after each op in case of failure" and "you can probably ignore this the way you do today with the chance of your Oracle installation failing before it gives you the answer."