Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-7346

Modify reconcile logic to always pick a tombstone over a counter cell

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Fix Version/s: 2.1 rc2
    • Component/s: None
    • Labels:
      None

      Description

      For counters, timestamps are automatically computed to be milliseconds since the epoch. For everything else, when not specified manually, they are microseconds since the epoch. This means if you delete a counter row, subsequent updates are lost unexpectedly.

      I know that deleting counters is not recommended, but that's only because deletes and incs don't commute. If you know you have stopped incs, then delete, then start again (with some external synchronization) then deleting is fine.

        Issue Links

          Activity

          Hide
          slebresne Sylvain Lebresne added a comment -

          If you know you have stopped incs, then delete, then start again (with some external synchronization) then deleting is fine.

          Actually, fyi, I don't think that's true because there is no guarantee that subsequent update will end up being ultimately merged (by compaction) in the order they were received by the replica. You could even theoretically have the case where it looks like everything worked properly (i.e. you get only post-delete stuff accounted for) for a while, and then a compaction ends up compacting the pre and post increment but not the delete and pouf you get a corrupted value.

          Which btw is not to say that using milliseconds for counter timestamps was on purpose, it's definitively an oversight that has persisted somehow. But we had discussed the idea of making the current behaviour more "official" by making counter deletes use Integer.MAX_VALUE. The rational being, if you try to increment-delete-increment, if the post-increment don't work, you might be surprised initially, but at least it makes it clear that something doesn't work. If doing an increment-delete-increment kind-of works in simple testing, it's easy to miss the fact that counter deletes are broken and to get bitten later on. Given that even with external synchronization you can't really guarantee that increment post-delete will work as said above, I'm tempted to go with the more clearly-broken-so-you-don-t-get-bitten-badly-later behaviour.

          Show
          slebresne Sylvain Lebresne added a comment - If you know you have stopped incs, then delete, then start again (with some external synchronization) then deleting is fine. Actually, fyi, I don't think that's true because there is no guarantee that subsequent update will end up being ultimately merged (by compaction) in the order they were received by the replica. You could even theoretically have the case where it looks like everything worked properly (i.e. you get only post-delete stuff accounted for) for a while, and then a compaction ends up compacting the pre and post increment but not the delete and pouf you get a corrupted value. Which btw is not to say that using milliseconds for counter timestamps was on purpose, it's definitively an oversight that has persisted somehow. But we had discussed the idea of making the current behaviour more "official" by making counter deletes use Integer.MAX_VALUE. The rational being, if you try to increment-delete-increment, if the post-increment don't work, you might be surprised initially, but at least it makes it clear that something doesn't work. If doing an increment-delete-increment kind-of works in simple testing, it's easy to miss the fact that counter deletes are broken and to get bitten later on. Given that even with external synchronization you can't really guarantee that increment post-delete will work as said above, I'm tempted to go with the more clearly-broken-so-you-don-t-get-bitten-badly-later behaviour.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          For everything else, when not specified manually, they are microseconds since the epoch.

          FYI, other than counters, there are also hints, that also use millis for timestamps. Both are known, 'official' oversights.

          Show
          iamaleksey Aleksey Yeschenko added a comment - For everything else, when not specified manually, they are microseconds since the epoch. FYI, other than counters, there are also hints, that also use millis for timestamps. Both are known, 'official' oversights.
          Hide
          rlow Richard Low added a comment -

          Good points, I now know that deletion is always unsafe

          But I like the idea of making it clearer that it doesn't work. I think there might be a few different kinds of tombstones that could cause this - probably anything that looks like a cql row delete.

          Show
          rlow Richard Low added a comment - Good points, I now know that deletion is always unsafe But I like the idea of making it clearer that it doesn't work. I think there might be a few different kinds of tombstones that could cause this - probably anything that looks like a cql row delete.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Richard Low FWIW, that was the approach taken by CASSANDRA-6506 (replacing counter tombstones' timestamps with Integer.MAX_VALUE, explicitly). This is also was we are most likely to end up with the second iteration of 6506, anyway (and replace the logical clock with microseconds since epoch, w/ some corrections if necessary).

          So you could say that this issue is ultimately a duplicate of CASSANDRA-6506. That said, I'm closing it with a 'Not a Problem' for now.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Richard Low FWIW, that was the approach taken by CASSANDRA-6506 (replacing counter tombstones' timestamps with Integer.MAX_VALUE, explicitly). This is also was we are most likely to end up with the second iteration of 6506, anyway (and replace the logical clock with microseconds since epoch, w/ some corrections if necessary). So you could say that this issue is ultimately a duplicate of CASSANDRA-6506 . That said, I'm closing it with a 'Not a Problem' for now.
          Hide
          slebresne Sylvain Lebresne added a comment -

          Don't we at least want to cleanup the current code by making counter tombstones use MAX_VALUE instead of relying on the current misuse millis-vs-micros? At least in 2.1?

          Show
          slebresne Sylvain Lebresne added a comment - Don't we at least want to cleanup the current code by making counter tombstones use MAX_VALUE instead of relying on the current misuse millis-vs-micros? At least in 2.1?
          Hide
          rlow Richard Low added a comment -

          +1 on Sylvain's suggestion.

          Show
          rlow Richard Low added a comment - +1 on Sylvain's suggestion.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          All right. Pushed to https://github.com/iamaleksey/cassandra/commits/7346 (the last and only commit there).

          Initially was going to convert the timestamps in CounterMutation itself, however, it turned out to be easier to modify CQL3/CQL2/Thrift (it's all isolated to UpdateParameters for CQL3 and two methods in CassandraServer for Thrift).

          Updated CQL2 too, even though it's gone from 3.0.

          Show
          iamaleksey Aleksey Yeschenko added a comment - All right. Pushed to https://github.com/iamaleksey/cassandra/commits/7346 (the last and only commit there). Initially was going to convert the timestamps in CounterMutation itself, however, it turned out to be easier to modify CQL3/CQL2/Thrift (it's all isolated to UpdateParameters for CQL3 and two methods in CassandraServer for Thrift). Updated CQL2 too, even though it's gone from 3.0.
          Hide
          slebresne Sylvain Lebresne added a comment -

          +1

          Show
          slebresne Sylvain Lebresne added a comment - +1
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Committed, thanks.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Committed, thanks.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Reverted in c94caec1b5266af4f20290b9d3d82faff4977aa7 (https://github.com/apache/cassandra/commit/c94caec1b5266af4f20290b9d3d82faff4977aa7) because it prevents us from improving counter reads in the future (CollationController#collectTimeOrderedData()) - when deletes are involved.

          What we are going to do instead is to special-case cell reconcile logic to always choose tombstones over a counter cell, which is more explicit and, ultimately, what we want to do.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Reverted in c94caec1b5266af4f20290b9d3d82faff4977aa7 ( https://github.com/apache/cassandra/commit/c94caec1b5266af4f20290b9d3d82faff4977aa7 ) because it prevents us from improving counter reads in the future (CollationController#collectTimeOrderedData()) - when deletes are involved. What we are going to do instead is to special-case cell reconcile logic to always choose tombstones over a counter cell, which is more explicit and, ultimately, what we want to do.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Also, as part of the ticket, fix the millisecond timestamps of the counter cells to use microseconds, for consistency with every other code path in Cassandra.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Also, as part of the ticket, fix the millisecond timestamps of the counter cells to use microseconds, for consistency with every other code path in Cassandra.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Pushed -f to the same branch (https://github.com/iamaleksey/cassandra/commits/7346)

          And made addCounter() to finally start using micros.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Pushed -f to the same branch ( https://github.com/iamaleksey/cassandra/commits/7346 ) And made addCounter() to finally start using micros.
          Hide
          slebresne Sylvain Lebresne added a comment -

          Lgtm, +1.

          One small nit: I'd put some isLive() in DeletionTime and replace the topLevel.markedForDeleteAt > Long.MIN_VALUE by !topLevel.isLive(). Feels more explicit/readable.

          Show
          slebresne Sylvain Lebresne added a comment - Lgtm, +1. One small nit: I'd put some isLive() in DeletionTime and replace the topLevel.markedForDeleteAt > Long.MIN_VALUE by !topLevel.isLive() . Feels more explicit/readable.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Committed with the nit addressed, thanks.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Committed with the nit addressed, thanks.

            People

            • Assignee:
              iamaleksey Aleksey Yeschenko
              Reporter:
              rlow Richard Low
              Reviewer:
              Sylvain Lebresne
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development