So, I've been mulling on this, and I think we can safely guarantee eventual consistency even without a write-path repair. The question is only timeliness, however this can be problematic with or without write-path repair, since DELETEs and UPDATEs for the same operation are not replicated in a coordinated fashion.
Timeliness / Consistency Caveats
If, for instance, three different updates are sent at once, and each base-replica receives a different update, each may be propagated onto two different nodes via repair, giving 9 nodes with live data. Eventually one more base-replica receives each of the original updates, and issue deletes for their data. So we have 6 nodes with live data, 3 with tombstones. The batchlog is now happy, however the system is in an inconsistent state until either the MV replicas are repaired again or - with write-path-repair - the base-replicas are.
Without write-path repair we may be more prone to this problem, but I don't think dramatically more, although I haven't the time to think it through exhaustively. AFAICT, repair must necessarily be able to introduce inconsistency that can only be resolved by another repair (which itself can, of course, introduce more inconsistency).
I'm pretty sure there are a multiplicity of similar scenarios, and certainly there are less extreme scenarios. Two competing updates and one repair are enough, so long as it's the "wrong" update repaired, and to the "wrong" MV replica it's repaired to.
There are also other problems unique to MV: if we lose any two base-replicas (which with vnodes means any two nodes), we can be left with ghost records that are never purged. So any concurrent loss of two nodes means we really need to truncate the MV and rebuild, or we need a special truncation record to truncate only the portion that was owned by those two nodes. This happens for any updates that were received only by those two nodes, but were then proxied on (or written to their batchlogs). This can of course affect any normal QUORUM updates to the cluster, the difference being that the user has no control over these, and simply resending your update to the cluster does not resolve the problem as it would in the current world. Users performing updates to single partitions that would have never been affected by this now also have this to worry about.
Certainly, I think we need to a bit of formal analysis or simulation of what the possible cluster states are. Ideally a simple model of how each piece of infrastructure works could be constructed in a single process to run the equivalent of years of operations a normal cluster would execute, to explore the levels of "badness" we can expect. That's just my opinion, but I think it would be invaluable, because after spending some spare time thinking about these problems, I think it is a very hard thing to do, and I would rather not trust our feelings about correctness.
As far as multiple columns are concerned: I think we may need to go back to the drawing board there. It's actually really easy to demonstrate the cluster getting into broken states. Say you have three columns, A B C, and you send three competing updates a b c to their respective columns; previously all held the value _. If they arrive in different orders on each base-replica we can end up with 6 different MV states around the cluster. If any base replica dies, you don't know which of those 6 intermediate states were taken (and probably replicated) by its MV replicas. This problem grows exponentially as you add "competing" updates (which, given split brain, can compete over arbitrarily long intervals).
This is where my concern about a "single (base) node" dependency comes in, but after consideration it's clear that with a single column this problem is avoided because it's never ambiguous what the old state was. If you encounter a mutation that is shadowed by your current data, you can always issue a delete for the correct prior state. With multiple columns that is no longer possible.
I'm pretty sure the presence of multiple columns introduces other issues with each of the other moving parts.
Important Implementation Detail
Had a quick browse, and I found that writeOrder is now wrapping multiple synchronous network operations. OpOrders are only intended to wrap local operations, so this should be rejigged to avoid locking up the cluster.
Anyway, hopefully that wall of text isn't too unappetizing, and is somewhat helpful. To summarize my thoughts, I think the following things are worthy of due consideration and potentially highlighting to the user:
- RF-1 node failures cause parts of the MV to receive NO updates, and remain incapable of responding correctly to any queries on their portion (this will apply unevenly for a given base update, meaning multiple generations of value could persist concurrently in the cluster)
- Any node loss results in a significantly larger hit to the consistency of the MV (in my example, 20% loss of QUORUM to cluster resulted in 57% loss to MV)
- Both of these are potentially avoidable by ensuring we try another node if "ours" is down, but due consideration needs to be given to if this potentially results in more cluster inconsistencies
- Repair seems to require possibility of introducing inconsistent cluster states that can only be repaired by repair (which introduces more such states at the same time), resulting in potentially lengthy inconsistencies, or repair frequency greater than can operationally be managed rught now
- Loss of any two nodes in a vnode cluster can result in permanent inconsistency
- Have we spotted all of the caveats?
- Rejig writeOrder usage
- Multiple columns need a lot more thought