Cassandra
  1. Cassandra
  2. CASSANDRA-2419

Risk of counter over-count when recovering commit log

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 0.8.0
    • Component/s: Core
    • Labels:

      Description

      When a memtable was flush, there is a small delay before the commit log replay position gets updated. If the node fails during this delay, all the updates of this memtable will be replay during commit log recovery and will end-up being over-counts.

      1. 0001-Record-and-use-sstable-replay-position.patch
        32 kB
        Sylvain Lebresne
      2. 0001-Record-CL-replay-infos-alongside-sstables-v2.patch
        63 kB
        Sylvain Lebresne
      3. 2419-v3.txt
        61 kB
        Jonathan Ellis
      4. 2419-v4.txt
        62 kB
        Jonathan Ellis
      5. 2419-v6.txt
        62 kB
        Jonathan Ellis

        Activity

        Hide
        Hudson added a comment -

        Integrated in Cassandra-0.8 #93 (See https://builds.apache.org/hudson/job/Cassandra-0.8/93/)

        Show
        Hudson added a comment - Integrated in Cassandra-0.8 #93 (See https://builds.apache.org/hudson/job/Cassandra-0.8/93/ )
        Hide
        Stu Hood added a comment -

        Thanks a ton for this work! Transactionality here we come.

        Show
        Stu Hood added a comment - Thanks a ton for this work! Transactionality here we come.
        Hide
        Jonathan Ellis added a comment -

        committed

        Show
        Jonathan Ellis added a comment - committed
        Hide
        Jonathan Ellis added a comment -

        in CommitLog.java:recover(), was there a reason to create a new List<ReplayPosition> positions instead of using cfPositions.values()

        Nope. I'll fix that before commit.

        Asked Paul Cannon to also review first though, since obviously we want to be extra sure not to cause regressions here.

        Show
        Jonathan Ellis added a comment - in CommitLog.java:recover(), was there a reason to create a new List<ReplayPosition> positions instead of using cfPositions.values() Nope. I'll fix that before commit. Asked Paul Cannon to also review first though, since obviously we want to be extra sure not to cause regressions here.
        Hide
        Sylvain Lebresne added a comment -

        Minor nitpick: in CommitLog.java:recover(), was there a reason to create a new List<ReplayPosition> positions instead of using cfPositions.values() ?

        but other than that, +1 on v6.

        Show
        Sylvain Lebresne added a comment - Minor nitpick: in CommitLog.java:recover(), was there a reason to create a new List<ReplayPosition> positions instead of using cfPositions.values() ? but other than that, +1 on v6.
        Hide
        Jonathan Ellis added a comment -

        v6 fixes a test failure in v5

        Show
        Jonathan Ellis added a comment - v6 fixes a test failure in v5
        Hide
        Jonathan Ellis added a comment -

        v5 updates CL replay to use RP.getReplayPosition as well

        Show
        Jonathan Ellis added a comment - v5 updates CL replay to use RP.getReplayPosition as well
        Hide
        Sylvain Lebresne added a comment -

        v4 looks good, I like those changes (note: I committed r1099037 to fix CFSTest, it had a test with hardcoded sstable filenames using version 'f' and thus was failing)

        Show
        Sylvain Lebresne added a comment - v4 looks good, I like those changes (note: I committed r1099037 to fix CFSTest, it had a test with hardcoded sstable filenames using version 'f' and thus was failing)
        Hide
        Jonathan Ellis added a comment -

        I tried two ways of storing metadata as yaml – first with the metadata as java beans that were stored directly as yaml, and second half-manually serializing to a yaml Map<String, Object> – and both feel clunkier than just using the version field to deal with adding things. (Especially when you need to do version checks anyway when modifying things rather than just adding new fields.)

        So, v4 is substantially the same as v3 but Descriptor version is bumped to g and we use that instead of EOF to when reading RP. (Also, writeStatistics is renamed to writeMetadata.)

        Show
        Jonathan Ellis added a comment - I tried two ways of storing metadata as yaml – first with the metadata as java beans that were stored directly as yaml, and second half-manually serializing to a yaml Map<String, Object> – and both feel clunkier than just using the version field to deal with adding things. (Especially when you need to do version checks anyway when modifying things rather than just adding new fields.) So, v4 is substantially the same as v3 but Descriptor version is bumped to g and we use that instead of EOF to when reading RP. (Also, writeStatistics is renamed to writeMetadata.)
        Hide
        Jonathan Ellis added a comment -

        v3 attached with some changes:

        • SSTableMetadata removed; replayposition becomes a field in SSTable that is serialized w/ statistics
        • RP is final and part of the SSTable constructor
        • RP implements Comparable instead of a one-off resolve API; RP.getReplayPosition encapsulates the find-replay-point logic
        • RP moved to a top-level class and replaces CLContext

        I'd like to make the metadata a json blob so we can extend it more easily, so I probably need to re-introduce SSTM. Consider v3 a work in progress.

        Show
        Jonathan Ellis added a comment - v3 attached with some changes: SSTableMetadata removed; replayposition becomes a field in SSTable that is serialized w/ statistics RP is final and part of the SSTable constructor RP implements Comparable instead of a one-off resolve API; RP.getReplayPosition encapsulates the find-replay-point logic RP moved to a top-level class and replaces CLContext I'd like to make the metadata a json blob so we can extend it more easily, so I probably need to re-introduce SSTM. Consider v3 a work in progress.
        Hide
        Sylvain Lebresne added a comment -

        v2 removes commit log header completely in favor of sstable metadata about where to replay (patch against 0.8).

        This differs from v1 in that instead of keeping every (segment, replay_position) pair, we keep for a given sstable, only the position for the most recent segment (that is, we leverage the fact that we use increasing timestamps for commit logs).

        The reason for this is twofold:

        1. this more compact (and simple)
        2. if we remove the commit log header, we need to be able to say if a given segment is dirty or not for a given column family. That is, we don't want to know if some replay position existed on this segment, but if a relevant one still exist. So for a given column family we really only care about the newest (segment, replay_position) pair.

        Now there is the question of the update path. With this patch, the (existing) commit log headers will be ignored. This means that ideally before updating to a version having this patch people would use drain. If they do not, then the commit logs will be fully replayed. Pre-0.8, it's not a big deal. With counters, this could mean over-counts (that's exactly what this ticket is about). So I would be in favor of putting this for 0.8.0, since it is a bug fix and it will avoids the problem of upgrading from a version already having counters. But I would admit this is not trivial patch, so ...

        Show
        Sylvain Lebresne added a comment - v2 removes commit log header completely in favor of sstable metadata about where to replay (patch against 0.8). This differs from v1 in that instead of keeping every (segment, replay_position) pair, we keep for a given sstable, only the position for the most recent segment (that is, we leverage the fact that we use increasing timestamps for commit logs). The reason for this is twofold: this more compact (and simple) if we remove the commit log header, we need to be able to say if a given segment is dirty or not for a given column family. That is, we don't want to know if some replay position existed on this segment, but if a relevant one still exist. So for a given column family we really only care about the newest (segment, replay_position) pair. Now there is the question of the update path. With this patch, the (existing) commit log headers will be ignored. This means that ideally before updating to a version having this patch people would use drain. If they do not, then the commit logs will be fully replayed. Pre-0.8, it's not a big deal. With counters, this could mean over-counts (that's exactly what this ticket is about). So I would be in favor of putting this for 0.8.0, since it is a bug fix and it will avoids the problem of upgrading from a version already having counters. But I would admit this is not trivial patch, so ...
        Hide
        Jonathan Ellis added a comment -

        I'm wondering, couldn't we just drop the commit log header if we do that.

        Were you planning to update w/ that change?

        Show
        Jonathan Ellis added a comment - I'm wondering, couldn't we just drop the commit log header if we do that. Were you planning to update w/ that change?
        Hide
        Jonathan Ellis added a comment -

        Yes, I think we can.

        Show
        Jonathan Ellis added a comment - Yes, I think we can.
        Hide
        Sylvain Lebresne added a comment -

        I'm wondering, couldn't we just drop the commit log header if we do that.

        Show
        Sylvain Lebresne added a comment - I'm wondering, couldn't we just drop the commit log header if we do that.
        Hide
        Sylvain Lebresne added a comment -

        Attaching patch implementing Jonathan's idea to record the CL replay position along with the sstable.

        Show
        Sylvain Lebresne added a comment - Attaching patch implementing Jonathan's idea to record the CL replay position along with the sstable.
        Hide
        Sylvain Lebresne added a comment -

        Right, that was just me not getting you idea at first. Make sense, sorry.

        Show
        Sylvain Lebresne added a comment - Right, that was just me not getting you idea at first. Make sense, sorry.
        Hide
        Jonathan Ellis added a comment - - edited

        I don't understand the problem. Say we have this situation:

        CommitLog-1302036825548.log: [full of writes to CF Foo counters, up to position 100. header reads dirty-at 50, our last flush position]
        Foo-g-45-Metadata.db: [flushed at position (1302036825548, 50)]
        Foo-g-46-Metadata.db: [flushed at position (1302036825548, 100)]

        We compact and get
        Foo-g-47-Metadata.db: [flushed at position (1302036825548, 100)]

        If we die and restart here we will correctly start reply of Foo at position 100 in this segment.

        (we can combine to a single flushed-at entry in this case since they were from the same CL segment. if they were from different segments we would keep both.)

        Show
        Jonathan Ellis added a comment - - edited I don't understand the problem. Say we have this situation: CommitLog-1302036825548.log: [full of writes to CF Foo counters, up to position 100. header reads dirty-at 50, our last flush position] Foo-g-45-Metadata.db: [flushed at position (1302036825548, 50)] Foo-g-46-Metadata.db: [flushed at position (1302036825548, 100)] We compact and get Foo-g-47-Metadata.db: [flushed at position (1302036825548, 100)] If we die and restart here we will correctly start reply of Foo at position 100 in this segment. (we can combine to a single flushed-at entry in this case since they were from the same CL segment. if they were from different segments we would keep both.)
        Hide
        Sylvain Lebresne added a comment -

        Actually I don't think this really solves the race condition. We really shouldn't compact the newly flushed sstable until we marked the commit log, because if we compact it, even if we're able to detect that 'some parts' of a sstable will be replay during recover, there is nothing we can do about it.

        Show
        Sylvain Lebresne added a comment - Actually I don't think this really solves the race condition. We really shouldn't compact the newly flushed sstable until we marked the commit log, because if we compact it, even if we're able to detect that 'some parts' of a sstable will be replay during recover, there is nothing we can do about it.
        Hide
        Jonathan Ellis added a comment -

        A separate component is a better idea.

        Show
        Jonathan Ellis added a comment - A separate component is a better idea.
        Hide
        Sylvain Lebresne added a comment -

        What about a new component .metadata for each sstable instead of a footer. I actually think we will have a use for other sstable metadata at some point anyway. For instance we could keep the file format version. That way we wouldn't rely so much on the data file name.

        Show
        Sylvain Lebresne added a comment - What about a new component .metadata for each sstable instead of a footer. I actually think we will have a use for other sstable metadata at some point anyway. For instance we could keep the file format version. That way we wouldn't rely so much on the data file name.
        Hide
        Jonathan Ellis added a comment -

        Hmm, I think we need both the CL header and this information, since this flush footer would only give us when we flushed which is not the same as "do I need to replay."

        For instance: if there is no flush marker for a commitlog segment in any existing sstable, that does not necessarily mean no data is in the commitlog for that CF.

        So replay position would be max(dirty at from CL header, flushed at from sstable footers).

        (You would need to allow multiple flush contexts in a single sstable footer, to preserve them during compaction.)

        Show
        Jonathan Ellis added a comment - Hmm, I think we need both the CL header and this information, since this flush footer would only give us when we flushed which is not the same as "do I need to replay." For instance: if there is no flush marker for a commitlog segment in any existing sstable, that does not necessarily mean no data is in the commitlog for that CF. So replay position would be max(dirty at from CL header, flushed at from sstable footers). (You would need to allow multiple flush contexts in a single sstable footer, to preserve them during compaction.)
        Hide
        Jonathan Ellis added a comment -

        What if instead of the CL "header" we record the CL context as part of an sstable footer? (footer is less likely to cause bugs w/ sstable math that assumes 0 = start of first row.) then there is no race.

        Show
        Jonathan Ellis added a comment - What if instead of the CL "header" we record the CL context as part of an sstable footer? (footer is less likely to cause bugs w/ sstable math that assumes 0 = start of first row.) then there is no race.
        Hide
        Sylvain Lebresne added a comment -

        One solution I see to this problem would be to record along with the replay position the time when we last updated this replay position. The during recover, we would first look at all the sstables (for the CF) and if a sstable is freshly flushed (which implies that we have a marker to know that a sstable was never compacted) and have a modification time higher that the last time we updated the replay position, then we'll just remove the sstable since we know it will be fully replayed.

        Note that to work correctly we also need a way to mark a freshly flushed sstable as 'non compactable' during the time it takes to mark the commit log.

        We would probably only do this for counter CF just to be on the safe side.

        Opinions ?

        Show
        Sylvain Lebresne added a comment - One solution I see to this problem would be to record along with the replay position the time when we last updated this replay position. The during recover, we would first look at all the sstables (for the CF) and if a sstable is freshly flushed (which implies that we have a marker to know that a sstable was never compacted) and have a modification time higher that the last time we updated the replay position, then we'll just remove the sstable since we know it will be fully replayed. Note that to work correctly we also need a way to mark a freshly flushed sstable as 'non compactable' during the time it takes to mark the commit log. We would probably only do this for counter CF just to be on the safe side. Opinions ?

          People

          • Assignee:
            Sylvain Lebresne
            Reporter:
            Sylvain Lebresne
            Reviewer:
            Jonathan Ellis
          • Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 8h
              8h
              Remaining:
              Remaining Estimate - 8h
              8h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development