Cassandra
  1. Cassandra
  2. CASSANDRA-223

time-based slicing does not work correctly w/ "historical" memtables

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 0.4
    • Component/s: Core
    • Labels:
      None

      Description

      TimeFilter assumes that it is done as soon as it finds a column stamped earlier than what it is filtering on, but when you have a group of "historical" memtables whose columns were written in an arbitrary order this is not a safe assumption.

      It is not even a safe assumption when dealing with a single memtable + sstable pair, as the attached new test shows.

      1. 223.patch
        1 kB
        Jonathan Ellis
      2. 223-2.patch
        17 kB
        Jonathan Ellis
      3. 223-2-v2.patch
        16 kB
        Jonathan Ellis

        Activity

        Hide
        Jonathan Ellis added a comment -

        committed

        Show
        Jonathan Ellis added a comment - committed
        Hide
        Jun Rao added a comment -

        The patch looks good to me.

        Show
        Jun Rao added a comment - The patch looks good to me.
        Hide
        Jonathan Ellis added a comment -

        oops. already had to rebase.

        Show
        Jonathan Ellis added a comment - oops. already had to rebase.
        Hide
        Jonathan Ellis added a comment -

        add brute-force fix for the bug: always merge in results from all memtables and sstables. (this is what bigtable does.)

        Show
        Jonathan Ellis added a comment - add brute-force fix for the bug: always merge in results from all memtables and sstables. (this is what bigtable does.)
        Hide
        Jun Rao added a comment -

        Well, as for SSTable compaction, we probably can learn from Lucene. Overall, Lucene tries to keep on the order of log index segments, where n is the total number of segments generated. It does that by keeping merging index segments (up to a merging factor) of about the same size. The current compaction code in cassandra seems to try to do that too. It's worth revisiting it though.

        Show
        Jun Rao added a comment - Well, as for SSTable compaction, we probably can learn from Lucene. Overall, Lucene tries to keep on the order of log index segments, where n is the total number of segments generated. It does that by keeping merging index segments (up to a merging factor) of about the same size. The current compaction code in cassandra seems to try to do that too. It's worth revisiting it though.
        Hide
        Jonathan Ellis added a comment -

        I came to the same conclusion.

        One partial answer to the files-to-read is to change compaction to guarantee log sstable files instead of the current ad-hoc behavior, where n is the maximum sstable "generation" number. (Where "generation" is the number of compactions done.)

        For each CF, when you flush, you compact until there is nothing already at the same generation to compact with. For example,

        flush 1: nothing to merge. memtable becomes sstable-gen0
        flush 2: there is already a sstable-gen0 so you merge. now you have sstable-gen1
        flush 3: no gen0, so you store there. now you have sstable-gen0, sstable-gen1
        flush 4: 0 and 1 exist, so you compact (with the new one) to sstable-gen2

        etc.

        Generation tracking can be done in the sstable filename.

        Show
        Jonathan Ellis added a comment - I came to the same conclusion. One partial answer to the files-to-read is to change compaction to guarantee log sstable files instead of the current ad-hoc behavior, where n is the maximum sstable "generation" number. (Where "generation" is the number of compactions done.) For each CF, when you flush, you compact until there is nothing already at the same generation to compact with. For example, flush 1: nothing to merge. memtable becomes sstable-gen0 flush 2: there is already a sstable-gen0 so you merge. now you have sstable-gen1 flush 3: no gen0, so you store there. now you have sstable-gen0, sstable-gen1 flush 4: 0 and 1 exist, so you compact (with the new one) to sstable-gen2 etc. Generation tracking can be done in the sstable filename.
        Hide
        Jun Rao added a comment -

        Since it's hard to guarantee any timestamp ordering among memtable and sstables because of the reasons listed above, the only way that we can get the correct answer is do look at the memtable and ALL sstables (this is what HBase has been doing). This affects the following APIs.
        get_column
        get_slice_by_names
        get_slice_by_name_range
        get_slice_since

        The implication is that those APIs will potentially run slower since there are more files to read. One can probably tune the performance by setting a proper compaction policy.

        Show
        Jun Rao added a comment - Since it's hard to guarantee any timestamp ordering among memtable and sstables because of the reasons listed above, the only way that we can get the correct answer is do look at the memtable and ALL sstables (this is what HBase has been doing). This affects the following APIs. get_column get_slice_by_names get_slice_by_name_range get_slice_since The implication is that those APIs will potentially run slower since there are more files to read. One can probably tune the performance by setting a proper compaction policy.
        Hide
        Jun Rao added a comment -

        get_column is also designed to stop iterating SSTables as soon as the requested column is found. Therefore, it could suffer from the above problems too.

        Show
        Jun Rao added a comment - get_column is also designed to stop iterating SSTables as soon as the requested column is found. Therefore, it could suffer from the above problems too.
        Hide
        Jun Rao added a comment -

        The assumption that columns in an SSTable with a larger file name always have more recent timestamp also breaks in the following cases:
        1. after SSTables are compacted (since the compaction works on a group of SSTables at a time)
        2. hinted data is delivered
        3. columns fixed through read repairs

        Show
        Jun Rao added a comment - The assumption that columns in an SSTable with a larger file name always have more recent timestamp also breaks in the following cases: 1. after SSTables are compacted (since the compaction works on a group of SSTables at a time) 2. hinted data is delivered 3. columns fixed through read repairs
        Hide
        Jonathan Ellis added a comment -

        Thinking about it more, the current behavior is sort of reasonable if you assume that timestamp values are strictly increasing. I prefer not to rely on this because it's impossible to sync clocks with perfect accuracy, but it's a reasonable optimization to make and consistent with the rest of Cassandra's design.

        In this case though, clock sync problems can lead to permanently inconsistent behavior – different queries will not agree on what data the node contains.

        Show
        Jonathan Ellis added a comment - Thinking about it more, the current behavior is sort of reasonable if you assume that timestamp values are strictly increasing. I prefer not to rely on this because it's impossible to sync clocks with perfect accuracy, but it's a reasonable optimization to make and consistent with the rest of Cassandra's design. In this case though, clock sync problems can lead to permanently inconsistent behavior – different queries will not agree on what data the node contains.

          People

          • Assignee:
            Jonathan Ellis
            Reporter:
            Jonathan Ellis
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development