Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Fix Version/s: 1.1.6
    • Component/s: None
    • Labels:
      None

      Description

      It seems that there are two corner cases where commitlog is not replayed after a restart :

      • After a reboot of a server + restart of cassandra (1.1.0 to 1.1.4)
      • After doing an upgrade from cassandra 1.1.X to cassandra 1.1.5

      This is due to the fact that the commitlog segment id should always be an incrementing number (see this condition : https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/commitlog/CommitLogReplayer.java#L247 )

      But this assertion can be broken :
      In the first case, it is generated by System.nanoTime() but it seems that System.nanoTime() is using the boot time as the base/reference (at least on java6 & linux), thus after a reboot, System.nanoTime() can return a lower number than before the reboot (and the javadoc says the reference is a relative point in time...)
      In the second case, this was introduced by #4601 (which changes System.nanoTime() by System.currentTimeMillis() thus people starting with 1.1.5 are safe)

      This could explain the following tickets : #4741 and #4481

      1. 4782.txt
        3 kB
        Jonathan Ellis

        Activity

        Fabien Rousseau created issue -
        Hide
        Brandon Williams added a comment -

        Thanks for the summary. What are you proposing we do?

        Show
        Brandon Williams added a comment - Thanks for the summary. What are you proposing we do?
        Hide
        Fabien Rousseau added a comment -

        To solve the first case, it's probably better, when possible, to upgrade to a newer version, then rewrite all SSTables (or at least all SSTables metadata)
        For the second case, just rewrite all SSTables (or at least all SSTables metadata)

        Each SSTable metadata contains the ReplayPosition (and the max is taken to know which commitlog to replay), thus if System.nanoTime() returned a number which is a timestamp in the future, previous commitlogs will be ignored),
        thus a drain should prevent from losing data (because there is no commitlog to replay).
        And because SSTables are immutables, then rewriting them completely seems a better option (rather than modifying previously written metadata)

        Maybe the simplest to do is to have a new option to nodetool (or a new option to nodetool upgradesstables) which only changes the metadata using the following rule :
        if the replayPosition.segment of the SSTable is in the future, then reset it to NONE otherwise, let it to its current value. (NONE is valid value if node was drained and restarted)

        By using this rule :

        • if a sstable was generated previously using a higher System.nanoTime() then it is reset
        • if a sstable was generated previously using a lower System.nanoTime() OR System.currentTimeMillis() then it is left as is
        • if a sstable is generated between the start & this rewriting process, then it is left as is

        Thus the upgrade should be :

        • drain node
        • upgrade
        • start
        • run the process described above

        What do you think ?

        Show
        Fabien Rousseau added a comment - To solve the first case, it's probably better, when possible, to upgrade to a newer version, then rewrite all SSTables (or at least all SSTables metadata) For the second case, just rewrite all SSTables (or at least all SSTables metadata) Each SSTable metadata contains the ReplayPosition (and the max is taken to know which commitlog to replay), thus if System.nanoTime() returned a number which is a timestamp in the future, previous commitlogs will be ignored), thus a drain should prevent from losing data (because there is no commitlog to replay). And because SSTables are immutables, then rewriting them completely seems a better option (rather than modifying previously written metadata) Maybe the simplest to do is to have a new option to nodetool (or a new option to nodetool upgradesstables) which only changes the metadata using the following rule : if the replayPosition.segment of the SSTable is in the future, then reset it to NONE otherwise, let it to its current value. (NONE is valid value if node was drained and restarted) By using this rule : if a sstable was generated previously using a higher System.nanoTime() then it is reset if a sstable was generated previously using a lower System.nanoTime() OR System.currentTimeMillis() then it is left as is if a sstable is generated between the start & this rewriting process, then it is left as is Thus the upgrade should be : drain node upgrade start run the process described above What do you think ?
        Jonathan Ellis made changes -
        Field Original Value New Value
        Assignee Jonathan Ellis [ jbellis ]
        Fix Version/s 1.1.6 [ 12323257 ]
        Affects Version/s 1.1.1 [ 12319857 ]
        Affects Version/s 1.1.2 [ 12321445 ]
        Affects Version/s 1.1.3 [ 12321881 ]
        Affects Version/s 1.1.4 [ 12322490 ]
        Affects Version/s 1.1.5 [ 12322941 ]
        Priority Major [ 3 ] Critical [ 2 ]
        Hide
        Jonathan Ellis added a comment -

        Thanks, Fabien. I admit that I didn't realize at first the implications of the millis fix on existing sstable metadata.

        Patch attached that bumps the sstable version to hf as a marker that we know metadata with that version has sane replay positions. Replay positions from older metadata will be treated as NONE.

        This will force a full replay the first restart on 1.1.6; afterwards, any newly flushed sstables will have the sane-replay-position marker and future restarts will not need to replay data unnecessarily.

        What do you think?

        Show
        Jonathan Ellis added a comment - Thanks, Fabien. I admit that I didn't realize at first the implications of the millis fix on existing sstable metadata. Patch attached that bumps the sstable version to hf as a marker that we know metadata with that version has sane replay positions. Replay positions from older metadata will be treated as NONE. This will force a full replay the first restart on 1.1.6; afterwards, any newly flushed sstables will have the sane-replay-position marker and future restarts will not need to replay data unnecessarily. What do you think?
        Jonathan Ellis made changes -
        Attachment 4782.txt [ 12548586 ]
        Jonathan Ellis made changes -
        Attachment 4782.txt [ 12548586 ]
        Hide
        Jonathan Ellis added a comment -

        (patch revised to update CURRENT_VERSION as well)

        Show
        Jonathan Ellis added a comment - (patch revised to update CURRENT_VERSION as well)
        Jonathan Ellis made changes -
        Attachment 4782.txt [ 12548588 ]
        Hide
        Fabien Rousseau added a comment -

        Thanks for the patch. This solution is more simple and elegant than the one I proposed.

        I tested it and it worked like a charm.
        Nevertheless, if there are counters CF, a drain is probably necessary to avoid replaying the full commitlog and avoid having overcounts. (I don't think it is a problem, just something to know before the upgrade...)

        Show
        Fabien Rousseau added a comment - Thanks for the patch. This solution is more simple and elegant than the one I proposed. I tested it and it worked like a charm. Nevertheless, if there are counters CF, a drain is probably necessary to avoid replaying the full commitlog and avoid having overcounts. (I don't think it is a problem, just something to know before the upgrade...)
        Hide
        Fabien Rousseau added a comment -

        By the way, I just noticed that the commitlog files were not replayed in the order of their ids.
        It seems that they are sorted by "last modification date" before being replayed, but this does not corresponds to their ids.
        Moreover, "last modification date" is changed when a file is copied, so, this could also change the order of archived commitlogs.

        I suppose the sort order of commit log files is for schemas ?

        Maybe it's safer to sort them using the id in the file name ?

        Show
        Fabien Rousseau added a comment - By the way, I just noticed that the commitlog files were not replayed in the order of their ids. It seems that they are sorted by "last modification date" before being replayed, but this does not corresponds to their ids. Moreover, "last modification date" is changed when a file is copied, so, this could also change the order of archived commitlogs. I suppose the sort order of commit log files is for schemas ? Maybe it's safer to sort them using the id in the file name ?
        Hide
        Jonathan Ellis added a comment -

        Committed with a warning in NEWS to drain.

        Go ahead and open a separate ticket to fix sort order.

        Show
        Jonathan Ellis added a comment - Committed with a warning in NEWS to drain. Go ahead and open a separate ticket to fix sort order.
        Jonathan Ellis made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Reviewer frousseau
        Resolution Fixed [ 1 ]
        Hide
        Robert Coli added a comment -

        If drain is required between versions to avoid this issue then CASSANDRA-4446, where drain sometimes doesn't actually drain, seems to have become more significant.

        Show
        Robert Coli added a comment - If drain is required between versions to avoid this issue then CASSANDRA-4446 , where drain sometimes doesn't actually drain, seems to have become more significant.
        Hide
        Omid Aladini added a comment -

        CASSANDRA-4446 does not happen to me any more (so far) when restarting 1.1.6 into 1.1.6. I could observe CASSANDRA-4446 on upgrade from 1.1.3 to 1.1.6 though.

        Show
        Omid Aladini added a comment - CASSANDRA-4446 does not happen to me any more (so far) when restarting 1.1.6 into 1.1.6. I could observe CASSANDRA-4446 on upgrade from 1.1.3 to 1.1.6 though.
        Hide
        Arya Goudarzi added a comment -

        +1. I think the issue I had with data loss after node restart, actually had to do with this bug. Thanks for the fix.

        Show
        Arya Goudarzi added a comment - +1. I think the issue I had with data loss after node restart, actually had to do with this bug. Thanks for the fix.
        Gavin made changes -
        Workflow no-reopen-closed, patch-avail [ 12728936 ] patch-available, re-open possible [ 12753439 ]
        Gavin made changes -
        Workflow patch-available, re-open possible [ 12753439 ] reopen-resolved, no closed status, patch-avail, testing [ 12758732 ]

          People

          • Assignee:
            Jonathan Ellis
            Reporter:
            Fabien Rousseau
            Reviewer:
            Fabien Rousseau
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development