We've observed this upgrading from 2.1.15 to 3.0.8 and from 2.1.16 to 3.0.10: some lightweight transactions executed on upgraded nodes fail with a read failure. The following conditions seem relevant to this occurring:
- The transaction must be conditioned on the current value of at least one column, e.g., IF NOT EXISTS transactions don't seem to be affected.
- There should be a collection column (in our case, a map) defined on the table on which the transaction is executed.
- The transaction should be executed before sstables on the node are upgraded. The failure does not occur after the sstables have been upgraded (whether via nodetool upgradesstables or effectively via compaction).
- Upgraded nodes seem to be able to participate in lightweight transactions as long as they're not the coordinator.
- The values in the row being manipulated by the transaction must have been consistently manipulated by lightweight transactions (perhaps the existence of Paxos state for the partition is somehow relevant?).
- In 3.0.10, it seems to be necessary to have the partition split across multiple legacy sstables. This was not necessary to reproduce the bug in 3.0.8 or .9.
For applications affected by this bug, a possible workaround is to prevent nodes being upgraded from coordinating requests until sstables have been upgraded.
We're able to reproduce this when upgrading from 2.1.16 to 3.0.10 with the following steps on a single-node cluster using a mostly pristine cassandra.yaml from the source distribution.
- Start Cassandra-2.1.16 on the node.
- Create a table with a collection column and insert some data into it.
- Flush the row to an sstable: nodetool flush.
- Update the row:
- Drain the node: nodetool drain
- Stop the node, upgrade to 3.0.10, and start the node.
- Attempt to update the row again:
Using cqlsh, if the error is reproduced, the following output will be returned:
and the following stack trace will be present in the system log:
Under both 3.0.8 and .9, the nodetool flush and additional UPDATE statement before upgrading to 3.0 are not necessary to reproduce this. In that case (when Cassandra only has to read the data from one sstable?), a different stack trace appears in the log. Here's a sample from 3.0.8:
It's not clear to us what changed in 3.0.10 to make this behavior somewhat more difficult to reproduce.
We spent some time trying to track down the cause in 3.0.8, and we've identified a very small patch (which I will attach to this issue) that seems to fix it. The problem appears to be that the logic that reads data from legacy sstables can pull range tombstones covering collection columns that weren't requested, which then breaks downstream logic that doesn't expect those tombstones to be present in the data. The patch attempts to include those tombstones only if they're explicitly requested. However, there's enough going on in that logic that it's not clear to us whether the change is safe, so it is definitely in need of review from someone knowledgable about what that area of the code is intended to do.