It seems that unless I'm missing something either is possible with the current release code, and thus these patches as well
Technically correct, but in practice we're in pretty good shape. The sequence is:
- Add the changing node to pending ranges
- Sleep for RING_DELAY so everyone else starts including the new target in their writes
- Flush data to be transferred
- Send over data for writes that happened before (1)
Step 1 happens on every coordinator. 2-4 only happen on the node that is giving up a token range.
The guarantee we need is that any write that happens before the pending range change, completes before the subsequent flush.
Even if we used TM.lock to protect the entire ARS sequence (guaranteeing that no local write is in progress once the PRC happens) we could still receive writes from other nodes that began their PRC change later.
So we rely on the RING_DELAY (30s) sleep. I suppose a GC pause for instance at just the wrong time could theoretically mean a mutation against the old state gets sent out late, but I don't see how we can improve it.
IMHO to be defensive, any time the write lock is acquired in TokenMetadata, the version should be bumped in the finally block before the lock is released
Haven't thought this through as much. What are you saying we should bump that we weren't calling invalidate on before?
Is the idea with the striped lock on the endpoint cache in AbstractReplicationStrategy to help smooth out the stampede effect when the "global" lock on the cached TM gets released after the fill?
I'm trying to avoid a minor stampede on calculateNaturalEndpoints (
CASSANDRA-3881) but it's probably premature optimization. v5 attached w/o that.