Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-11991

On clock skew, paxos may "corrupt" the node clock

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Fix Version/s: 2.1.15, 2.2.7, 3.0.8, 3.8
    • Component/s: None
    • Labels:
      None

      Description

      W made a mistake in CASSANDRA-9649 so that a temporal clock skew on one node can "corrupt" other node clocks through Paxos. That wasn't intended and we should fix that. I'll attach a patch later.

        Activity

        Hide
        kohlisankalp sankalp kohli added a comment -

        I can review the patch. We should try to get it in for 2.1.15 if possible

        Show
        kohlisankalp sankalp kohli added a comment - I can review the patch. We should try to get it in for 2.1.15 if possible
        Hide
        slebresne Sylvain Lebresne added a comment -

        For context, the problem is basically the one I described in my comment on CASSANDRA-9649 and for which I suggested reverting CASSANDRA-7801.

        Now, I was kind of wrong about reverting CASSANDRA-7801 since since CASSANDRA-9649 we were relying on ClientState.getTimestamp() to give use timestamp that were unique for the running VM, which meant we can't blindly revert CASSANDRA-7801.

        What I think is the simplest solution however is to stop relying on that property (of ClientState.getTimestamp()) for the uniqueness of our ballots, but instead randomize the non-timestamp parts of the ballot for every new ballot. With that, we don't have to revert CASSANDRA-7801, we just have to ensure that if we use the last known proposal timestamp (i.e. if whomever clock generated that timestamp is "in the future"), we don't persist it in the local clock (this in turn means the timestamp might not be unique in the VM for 2 concurrent paxos operation and hence the need to randomize the rest of the UUID).

        I've pushed a patch for this for 2.1. I'll attach branches for 2.2+ with tests tomorrow (but was waiting on the 2.1 results before doing that) but I don't think the modified code has changed since 2.1 so marking ready for review in the meantime.

        2.1 utests dtests
        Show
        slebresne Sylvain Lebresne added a comment - For context, the problem is basically the one I described in my comment on CASSANDRA-9649 and for which I suggested reverting CASSANDRA-7801 . Now, I was kind of wrong about reverting CASSANDRA-7801 since since CASSANDRA-9649 we were relying on ClientState.getTimestamp() to give use timestamp that were unique for the running VM, which meant we can't blindly revert CASSANDRA-7801 . What I think is the simplest solution however is to stop relying on that property (of ClientState.getTimestamp() ) for the uniqueness of our ballots, but instead randomize the non-timestamp parts of the ballot for every new ballot. With that, we don't have to revert CASSANDRA-7801 , we just have to ensure that if we use the last known proposal timestamp (i.e. if whomever clock generated that timestamp is "in the future"), we don't persist it in the local clock (this in turn means the timestamp might not be unique in the VM for 2 concurrent paxos operation and hence the need to randomize the rest of the UUID). I've pushed a patch for this for 2.1. I'll attach branches for 2.2+ with tests tomorrow (but was waiting on the 2.1 results before doing that) but I don't think the modified code has changed since 2.1 so marking ready for review in the meantime. 2.1 utests dtests
        Hide
        jasobrown Jason Brown added a comment -

        lgtm. The only minor nit I have is CASSANDRA-9649 made ClientState#lastTimestampMicros a static field. I believe this change helped accelerate the cluster get into a bad state wrt the propagation of bad timestamps. wdyt about switching it back to being an instance field (not static)?

        Show
        jasobrown Jason Brown added a comment - lgtm. The only minor nit I have is CASSANDRA-9649 made ClientState#lastTimestampMicros a static field. I believe this change helped accelerate the cluster get into a bad state wrt the propagation of bad timestamps. wdyt about switching it back to being an instance field (not static)?
        Hide
        slebresne Sylvain Lebresne added a comment -

        The only minor nit I have is CASSANDRA-9649 made ClientState#lastTimestampMicros a static field. believe this change helped accelerate the cluster get into a bad state wrt the propagation of bad timestamps. wdyt about switching it back to being an instance field (not static)?

        There is really 2 reasons I made it static:

        1. CASSANDRA-7801: it's not a huge thing, but it helps user being less confused.
        2. because I feel that having it not static was a mistake in the first place. That is, even if we completely forget about Paxos, I think having our CQL clock strictly monotonic per-node (rather than per-connection) is cleaner and less surprising to people. So I'm not a fan of getting back to the older behavior, and doing so could be considered a breaking change (easier to give new guarantees than give some away).

        I don't disagree that fact made the consequences of this bug worst, but I don't think removing the static is minor at all. And that code feel easy enough to convince oneself that we're not modifying lastTimestampMicros in bad ways anymore, so hopefully we won't make that mistake anymore.

        Besides, if you're smart, you should be switching to client generated timestamps anyway

        Show
        slebresne Sylvain Lebresne added a comment - The only minor nit I have is CASSANDRA-9649 made ClientState#lastTimestampMicros a static field. believe this change helped accelerate the cluster get into a bad state wrt the propagation of bad timestamps. wdyt about switching it back to being an instance field (not static)? There is really 2 reasons I made it static: CASSANDRA-7801 : it's not a huge thing, but it helps user being less confused. because I feel that having it not static was a mistake in the first place. That is, even if we completely forget about Paxos, I think having our CQL clock strictly monotonic per-node (rather than per-connection) is cleaner and less surprising to people. So I'm not a fan of getting back to the older behavior, and doing so could be considered a breaking change (easier to give new guarantees than give some away). I don't disagree that fact made the consequences of this bug worst, but I don't think removing the static is minor at all. And that code feel easy enough to convince oneself that we're not modifying lastTimestampMicros in bad ways anymore, so hopefully we won't make that mistake anymore. Besides, if you're smart, you should be switching to client generated timestamps anyway
        Hide
        slebresne Sylvain Lebresne added a comment -

        Also, I've merged the patch up (no conflict whatsoever) and started CI on all branches:

        version utests dtests
        2.1 utests dtests
        2.2 utests dtests
        3.0 utests dtests
        trunk utests dtests

        I also updated the comment on top of ClientState#lastTimestampMicros since it wasn't complete on the real motivation.

        Show
        slebresne Sylvain Lebresne added a comment - Also, I've merged the patch up (no conflict whatsoever) and started CI on all branches: version utests dtests 2.1 utests dtests 2.2 utests dtests 3.0 utests dtests trunk utests dtests I also updated the comment on top of ClientState#lastTimestampMicros since it wasn't complete on the real motivation.
        Hide
        jasobrown Jason Brown added a comment -

        I think having our CQL clock strictly monotonic per-node (rather than per-connection) is cleaner and less surprising to people.

        ok, I'll buy that.

        Otherwise, +1 on all branches. I checked out the test results, and the ones that failed were either 1) unrelated to this CAS change or 2) test timeouts (and still unrelated to CAS).

        Show
        jasobrown Jason Brown added a comment - I think having our CQL clock strictly monotonic per-node (rather than per-connection) is cleaner and less surprising to people. ok, I'll buy that. Otherwise, +1 on all branches. I checked out the test results, and the ones that failed were either 1) unrelated to this CAS change or 2) test timeouts (and still unrelated to CAS).
        Hide
        kohlisankalp sankalp kohli added a comment -

        +1 Sylvain Lebresne Please commit it.

        Show
        kohlisankalp sankalp kohli added a comment - +1 Sylvain Lebresne Please commit it.
        Hide
        slebresne Sylvain Lebresne added a comment -

        Committed, thanks.

        Show
        slebresne Sylvain Lebresne added a comment - Committed, thanks.

          People

          • Assignee:
            slebresne Sylvain Lebresne
            Reporter:
            slebresne Sylvain Lebresne
            Reviewer:
            Jason Brown
          • Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development