[CASSANDRA-14568] Static collection deletions are corrupted in 3.0 -> 2.{1,2} messages - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Urgent
Resolution: Fixed
Fix Version/s: 3.0.17, 3.11.3
Component/s: Messaging/Internode
Labels:
None

Bug Category:
Correctness - Recoverable Corruption / Loss
Severity:
Critical
Complexity:
Challenging
Discovered By:
Code Inspection
Since Version:

3.0.0

Description

In 2.1 and 2.2, row and complex deletions were represented as range tombstones. LegacyLayout is our compatibility layer, that translates the relevant RT patterns in 2.1/2.2 to row/complex deletions in 3.0, and vice versa. Unfortunately, it does not handle the special case of static row deletions, they are treated as regular row deletions. Since static rows are themselves never directly deleted, the only issue is with collection deletions.

Collection deletions in 2.1/2.2 were encoded as a range tombstone, consisting of a sequence of the clustering keys’ data for the affected row, followed by the bytes representing the name of the collection column. STATIC_CLUSTERING contains zero clusterings, so by treating the deletion as for a regular row, zero clusterings are written to precede the column name of the erased collection, so the column name is written at position zero.

This can exhibit itself in at least two ways:

If the type of your first clustering key is a variable width type, new deletes will begin appearing covering the clustering key represented by the column name.
- If you have multiple clustering keys, you will receive a RT covering all those rows with a matching first clustering key.
- This RT will be valid as far as the system is concerned, and go undetected unless there are outside data quality checks in place.
Otherwise, an invalid size of data will be written to the clustering and sent over the network to the 2.1 node.
- The 2.1/2.2 node will handle this just fine, even though the record is junk. Since it is a deletion covering impossible data, there will be no user-API visible effect. But if received as a write from a 3.0 node, it will dutifully persist the junk record.
- The 3.0 node that originally sent this junk, may later coordinate a read of the partition, and will notice a digest mismatch, read-repair and serialize the junk to disk
- The sstable containing this record is now corrupt; the deserialization expects fixed-width data, but it encounters too many (or too few) bytes, and is now at an incorrect position to read its structural information
- (Alternatively when the 2.1 node is upgraded this will occur on eventual compaction)

Attachments

Activity

People

Assignee:: Benedict Elliott Smith

Reporter:: Benedict Elliott Smith

Authors:: Benedict Elliott Smith

Reviewers:: Aleksey Yeschenko, Sylvain Lebresne

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 13/Jul/18 22:31

Updated:: 02/Aug/19 02:47

Resolved:: 14/Sep/18 10:28