Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-17251

USING writetime + ttl is non-idempotent leading to non-deterministic merge iteration results

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Patch Available
    • Normal
    • Resolution: Unresolved
    • 3.0.x, 3.11.x, 4.0.x
    • Local/Other
    • None
    • Correctness - Consistency
    • Normal
    • Normal
    • User Report
    • All
    • None
    • Hide

      Added ConflictsTest

      Show
      Added ConflictsTest

    Description

      The combination of USING writetime = timestamp and ttl = ttl can result in non-deterministic MergeIterator results causing DigestMismatchExceptions and increased latencies. The increased latencies are caused by additional round trips due to the digest mismatch as well as read repair rewriting the data. The additional writes lead to an increase in the number of sstables the key is stored in and must be scanned on read.

      The order of events is:
      1. for a given partition a write is performed with USING timestamp = sometime and ttl = ttl1.
      2. Cassandra records this write with timestamp = sometime, ttl = ttl1, expires_at = now + ttl1
      3. after N seconds, for the same partition, another write is performed with USING timestamp = sometime and ttl = ttl2 where ttl2 = ttl1 - N. This write only makes it to a subset of replicas* for some reason (e.g. partial write, node down, etc).
      4. Cassandra records this write with timestamp = sometime, ttl = ttl2, expires_at = now + ttl2. Its important to note that at this point, expires_at in 2 above is equal to expires at here. This is because it is calculated relative to the current write time not the provided timestamp and the ttl has been adjusted by the time passed. This write also makes it to a subset of replicas*.
      5. A read of the data is performed.
      5a. The MergeIterator resolves conflicts locally (accross sstables) using Conflicts.resolveRegular or Cells.resolveRegular. The resolution takes into account the write timestamp , the liveness of the cell, the values themselves, and how much time is left to live via the expires_at field. In this scenario, all of these fields are equal, leading to Cassandra picking the sstable "on the right" – this is non-deterministic. The only item that differs is the ttl itself.
      5b. One node returns the non-deterministically chosen value for the row, the other two calculate and send a digest to the coordinator. The digest includes the relative ttl field which may not match. This results in a DigestMismatchException at the coordinator.
      6. Read repair is triggered

      *NOTE: its not strictly necessary for the write to make it to a subset of replicas. sstables can also be ordered in random orders for reasons like compaction or repair when returned from the live set which can lead to the same behavior. This also affects repair from what we can tell.

      Attachments

        Activity

          People

            jwest Jordan West
            jwest Jordan West
            Jordan West
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: