Brandon's first patch fixing reads at CL.ALL turns out to be the only bug. The rest is obscure-but-valid behavior when expired tombstones haven't been replicated across the cluster (i.e., the tombstones exist on some nodes, but not all). Let me give an example:
say node A has columns x and y, where x is an expired tombstone with timestamp T1, and node B has live column x, at time T2 where T2 < T1.
if you read at ALL you will see x from B and y from A. you will not see x from A – since it is expired, it is no longer relevant off-node. thus, the ALL read will send a repair of column x to A, since it was "missing."
But next time you read from A the tombstone will supress the newly-written copy of x-from-B still, because its timestamp is higher. So the replicas won't converge.
This is not a bug, because the design explicitly allows that behavior when tombstones expire before being propagated to all nodes; see http://wiki.apache.org/cassandra/DistributedDeletes. The best way to avoid this of course is to run repair frequently enough to ensure that tombstones are propagated within GCGraceSeconds of being written.
But if you do find yourself in this situation, you have two options to get things to converge again:
1) the simplest option is to simply perform a major compaction on each node, which will eliminate all expired tombstones.
2) but if you want to propagate as many of the tombstones as possible first, increase your GCGraceSeconds setting everywhere (requires rolling restart), and perform a full repair as described in http://wiki.apache.org/cassandra/Operations. After the repair is complete you can put GCGraceSeconds back to what it was.