Uploaded image for project: 'Apache Cassandra'
  1. Apache Cassandra
  2. CASSANDRA-14568

Static collection deletions are corrupted in 3.0 -> 2.{1,2} messages

    XMLWordPrintableJSON

Details

    • Correctness - Recoverable Corruption / Loss
    • Critical
    • Challenging
    • Code Inspection

    Description

      In 2.1 and 2.2, row and complex deletions were represented as range tombstones.  LegacyLayout is our compatibility layer, that translates the relevant RT patterns in 2.1/2.2 to row/complex deletions in 3.0, and vice versa.  Unfortunately, it does not handle the special case of static row deletions, they are treated as regular row deletions. Since static rows are themselves never directly deleted, the only issue is with collection deletions.

      Collection deletions in 2.1/2.2 were encoded as a range tombstone, consisting of a sequence of the clustering keys’ data for the affected row, followed by the bytes representing the name of the collection column.  STATIC_CLUSTERING contains zero clusterings, so by treating the deletion as for a regular row, zero clusterings are written to precede the column name of the erased collection, so the column name is written at position zero.

      This can exhibit itself in at least two ways:

      1. If the type of your first clustering key is a variable width type, new deletes will begin appearing covering the clustering key represented by the column name.
        • If you have multiple clustering keys, you will receive a RT covering all those rows with a matching first clustering key.
        • This RT will be valid as far as the system is concerned, and go undetected unless there are outside data quality checks in place.
      2. Otherwise, an invalid size of data will be written to the clustering and sent over the network to the 2.1 node.
        • The 2.1/2.2 node will handle this just fine, even though the record is junk.  Since it is a deletion covering impossible data, there will be no user-API visible effect.  But if received as a write from a 3.0 node, it will dutifully persist the junk record.
        • The 3.0 node that originally sent this junk, may later coordinate a read of the partition, and will notice a digest mismatch, read-repair and serialize the junk to disk
        • The sstable containing this record is now corrupt; the deserialization expects fixed-width data, but it encounters too many (or too few) bytes, and is now at an incorrect position to read its structural information
        • (Alternatively when the 2.1 node is upgraded this will occur on eventual compaction)

      Attachments

        Activity

          People

            benedict Benedict Elliott Smith
            benedict Benedict Elliott Smith
            Benedict Elliott Smith
            Aleksey Yeschenko, Sylvain Lebresne
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: