Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 1.0.0
    • Component/s: Core
    • Labels:
      None

      Description

      After seeing the I/O issues in CASSANDRA-1470, I've been doing some more thinking on this subject that I wanted to lay out.

      I propose we redo the concept of how compaction works in Cassandra. At the moment, compaction is kicked off based on a write access pattern, not read access pattern. In most cases, you want the opposite. You want to be able to track how well each SSTable is performing in the system. If we were to keep statistics in-memory of each SSTable, prioritize them based on most accessed, and bloom filter hit/miss ratios, we could intelligently group sstables that are being read most often and schedule them for compaction. We could also schedule lower priority maintenance on SSTable's not often accessed.

      I also propose we limit the size of each SSTable to a fix sized, that gives us the ability to better utilize our bloom filters in a predictable manner. At the moment after a certain size, the bloom filters become less reliable. This would also allow us to group data most accessed. Currently the size of an SSTable can grow to a point where large portions of the data might not actually be accessed as often.

      1. 1608-22082011.txt
        67 kB
        Benjamin Coverston
      2. 1608-v2.txt
        37 kB
        Jonathan Ellis
      3. 1608-v4.txt
        62 kB
        Jonathan Ellis
      4. 1608-v5.txt
        77 kB
        Benjamin Coverston

        Issue Links

          Activity

          Hide
          Benjamin Coverston added a comment - - edited

          Additional note: test suite runs about 20% slower for me w/ Leveled compactions. Unsure if that should be expected.

          That's not entirely expected. It's probably due in part to the amount of flushing that we force during in the tests. Flushes and compactions both trigger interval tree builds.

          Other than that the codepaths are the same.

          Show
          Benjamin Coverston added a comment - - edited Additional note: test suite runs about 20% slower for me w/ Leveled compactions. Unsure if that should be expected. That's not entirely expected. It's probably due in part to the amount of flushing that we force during in the tests. Flushes and compactions both trigger interval tree builds. Other than that the codepaths are the same.
          Hide
          Jonathan Ellis added a comment -

          Also created CASSANDRA-3086.

          Show
          Jonathan Ellis added a comment - Also created CASSANDRA-3086 .
          Hide
          Hudson added a comment -

          Integrated in Cassandra #1050 (See https://builds.apache.org/job/Cassandra/1050/)
          add LeveledCompactionStrategy (take 2)
          patch by Ben Coverston; reviewed by jbellis for CASSANDRA-1608
          add LeveledCompactionStrategy
          patch by Ben Coverston; reviewed by jbellis for CASSANDRA-1608

          jbellis : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1162223
          Files :

          • /cassandra/trunk/CHANGES.txt
          • /cassandra/trunk/src/java/org/apache/cassandra/config/CFMetaData.java
          • /cassandra/trunk/src/java/org/apache/cassandra/db/CollationController.java
          • /cassandra/trunk/src/java/org/apache/cassandra/db/ColumnFamilyStore.java
          • /cassandra/trunk/src/java/org/apache/cassandra/db/ColumnFamilyStoreMBean.java
          • /cassandra/trunk/src/java/org/apache/cassandra/db/DataTracker.java
          • /cassandra/trunk/src/java/org/apache/cassandra/db/RowIteratorFactory.java
          • /cassandra/trunk/src/java/org/apache/cassandra/db/compaction/AbstractCompactionStrategy.java
          • /cassandra/trunk/src/java/org/apache/cassandra/db/compaction/CompactionTask.java
          • /cassandra/trunk/src/java/org/apache/cassandra/db/compaction/LeveledCompactionStrategy.java
          • /cassandra/trunk/src/java/org/apache/cassandra/db/compaction/LeveledCompactionTask.java
          • /cassandra/trunk/src/java/org/apache/cassandra/db/compaction/LeveledManifest.java
          • /cassandra/trunk/src/java/org/apache/cassandra/io/sstable/SSTable.java
          • /cassandra/trunk/src/java/org/apache/cassandra/io/sstable/SSTableReader.java
          • /cassandra/trunk/src/java/org/apache/cassandra/io/sstable/SSTableWriter.java
          • /cassandra/trunk/src/java/org/apache/cassandra/notifications/INotification.java
          • /cassandra/trunk/src/java/org/apache/cassandra/notifications/INotificationConsumer.java
          • /cassandra/trunk/src/java/org/apache/cassandra/notifications/SSTableAddedNotification.java
          • /cassandra/trunk/src/java/org/apache/cassandra/notifications/SSTableListChangedNotification.java
          • /cassandra/trunk/src/java/org/apache/cassandra/utils/IntervalTree/Interval.java
          • /cassandra/trunk/src/java/org/apache/cassandra/utils/IntervalTree/IntervalNode.java
          • /cassandra/trunk/src/java/org/apache/cassandra/utils/IntervalTree/IntervalTree.java
          • /cassandra/trunk/test/unit/org/apache/cassandra/utils/IntervalTest.java
          • /cassandra/trunk/test/unit/org/apache/cassandra/utils/IntervalTreeTest.java

          jbellis : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1162217
          Files :

          • /cassandra/trunk/CHANGES.txt
          • /cassandra/trunk/src/java/org/apache/cassandra/config/CFMetaData.java
          • /cassandra/trunk/src/java/org/apache/cassandra/db/CollationController.java
          • /cassandra/trunk/src/java/org/apache/cassandra/db/ColumnFamilyStore.java
          • /cassandra/trunk/src/java/org/apache/cassandra/db/ColumnFamilyStoreMBean.java
          • /cassandra/trunk/src/java/org/apache/cassandra/db/DataTracker.java
          • /cassandra/trunk/src/java/org/apache/cassandra/db/RowIteratorFactory.java
          • /cassandra/trunk/src/java/org/apache/cassandra/db/compaction/AbstractCompactionStrategy.java
          • /cassandra/trunk/src/java/org/apache/cassandra/db/compaction/CompactionTask.java
          • /cassandra/trunk/src/java/org/apache/cassandra/db/compaction/LeveledCompactionStrategy.java
          • /cassandra/trunk/src/java/org/apache/cassandra/db/compaction/LeveledCompactionTask.java
          • /cassandra/trunk/src/java/org/apache/cassandra/db/compaction/LeveledManifest.java
          • /cassandra/trunk/src/java/org/apache/cassandra/io/sstable/SSTable.java
          • /cassandra/trunk/src/java/org/apache/cassandra/io/sstable/SSTableReader.java
          • /cassandra/trunk/src/java/org/apache/cassandra/io/sstable/SSTableWriter.java
          • /cassandra/trunk/src/java/org/apache/cassandra/notifications
          • /cassandra/trunk/src/java/org/apache/cassandra/notifications/INotification.java
          • /cassandra/trunk/src/java/org/apache/cassandra/notifications/INotificationConsumer.java
          • /cassandra/trunk/src/java/org/apache/cassandra/notifications/SSTableAddedNotification.java
          • /cassandra/trunk/src/java/org/apache/cassandra/notifications/SSTableListChangedNotification.java
          • /cassandra/trunk/src/java/org/apache/cassandra/utils/IntervalTree
          • /cassandra/trunk/src/java/org/apache/cassandra/utils/IntervalTree/Interval.java
          • /cassandra/trunk/src/java/org/apache/cassandra/utils/IntervalTree/IntervalNode.java
          • /cassandra/trunk/src/java/org/apache/cassandra/utils/IntervalTree/IntervalTree.java
          • /cassandra/trunk/test/unit/org/apache/cassandra/utils/IntervalTest.java
          • /cassandra/trunk/test/unit/org/apache/cassandra/utils/IntervalTreeTest.java
          Show
          Hudson added a comment - Integrated in Cassandra #1050 (See https://builds.apache.org/job/Cassandra/1050/ ) add LeveledCompactionStrategy (take 2) patch by Ben Coverston; reviewed by jbellis for CASSANDRA-1608 add LeveledCompactionStrategy patch by Ben Coverston; reviewed by jbellis for CASSANDRA-1608 jbellis : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1162223 Files : /cassandra/trunk/CHANGES.txt /cassandra/trunk/src/java/org/apache/cassandra/config/CFMetaData.java /cassandra/trunk/src/java/org/apache/cassandra/db/CollationController.java /cassandra/trunk/src/java/org/apache/cassandra/db/ColumnFamilyStore.java /cassandra/trunk/src/java/org/apache/cassandra/db/ColumnFamilyStoreMBean.java /cassandra/trunk/src/java/org/apache/cassandra/db/DataTracker.java /cassandra/trunk/src/java/org/apache/cassandra/db/RowIteratorFactory.java /cassandra/trunk/src/java/org/apache/cassandra/db/compaction/AbstractCompactionStrategy.java /cassandra/trunk/src/java/org/apache/cassandra/db/compaction/CompactionTask.java /cassandra/trunk/src/java/org/apache/cassandra/db/compaction/LeveledCompactionStrategy.java /cassandra/trunk/src/java/org/apache/cassandra/db/compaction/LeveledCompactionTask.java /cassandra/trunk/src/java/org/apache/cassandra/db/compaction/LeveledManifest.java /cassandra/trunk/src/java/org/apache/cassandra/io/sstable/SSTable.java /cassandra/trunk/src/java/org/apache/cassandra/io/sstable/SSTableReader.java /cassandra/trunk/src/java/org/apache/cassandra/io/sstable/SSTableWriter.java /cassandra/trunk/src/java/org/apache/cassandra/notifications/INotification.java /cassandra/trunk/src/java/org/apache/cassandra/notifications/INotificationConsumer.java /cassandra/trunk/src/java/org/apache/cassandra/notifications/SSTableAddedNotification.java /cassandra/trunk/src/java/org/apache/cassandra/notifications/SSTableListChangedNotification.java /cassandra/trunk/src/java/org/apache/cassandra/utils/IntervalTree/Interval.java /cassandra/trunk/src/java/org/apache/cassandra/utils/IntervalTree/IntervalNode.java /cassandra/trunk/src/java/org/apache/cassandra/utils/IntervalTree/IntervalTree.java /cassandra/trunk/test/unit/org/apache/cassandra/utils/IntervalTest.java /cassandra/trunk/test/unit/org/apache/cassandra/utils/IntervalTreeTest.java jbellis : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1162217 Files : /cassandra/trunk/CHANGES.txt /cassandra/trunk/src/java/org/apache/cassandra/config/CFMetaData.java /cassandra/trunk/src/java/org/apache/cassandra/db/CollationController.java /cassandra/trunk/src/java/org/apache/cassandra/db/ColumnFamilyStore.java /cassandra/trunk/src/java/org/apache/cassandra/db/ColumnFamilyStoreMBean.java /cassandra/trunk/src/java/org/apache/cassandra/db/DataTracker.java /cassandra/trunk/src/java/org/apache/cassandra/db/RowIteratorFactory.java /cassandra/trunk/src/java/org/apache/cassandra/db/compaction/AbstractCompactionStrategy.java /cassandra/trunk/src/java/org/apache/cassandra/db/compaction/CompactionTask.java /cassandra/trunk/src/java/org/apache/cassandra/db/compaction/LeveledCompactionStrategy.java /cassandra/trunk/src/java/org/apache/cassandra/db/compaction/LeveledCompactionTask.java /cassandra/trunk/src/java/org/apache/cassandra/db/compaction/LeveledManifest.java /cassandra/trunk/src/java/org/apache/cassandra/io/sstable/SSTable.java /cassandra/trunk/src/java/org/apache/cassandra/io/sstable/SSTableReader.java /cassandra/trunk/src/java/org/apache/cassandra/io/sstable/SSTableWriter.java /cassandra/trunk/src/java/org/apache/cassandra/notifications /cassandra/trunk/src/java/org/apache/cassandra/notifications/INotification.java /cassandra/trunk/src/java/org/apache/cassandra/notifications/INotificationConsumer.java /cassandra/trunk/src/java/org/apache/cassandra/notifications/SSTableAddedNotification.java /cassandra/trunk/src/java/org/apache/cassandra/notifications/SSTableListChangedNotification.java /cassandra/trunk/src/java/org/apache/cassandra/utils/IntervalTree /cassandra/trunk/src/java/org/apache/cassandra/utils/IntervalTree/Interval.java /cassandra/trunk/src/java/org/apache/cassandra/utils/IntervalTree/IntervalNode.java /cassandra/trunk/src/java/org/apache/cassandra/utils/IntervalTree/IntervalTree.java /cassandra/trunk/test/unit/org/apache/cassandra/utils/IntervalTest.java /cassandra/trunk/test/unit/org/apache/cassandra/utils/IntervalTreeTest.java
          Hide
          Jonathan Ellis added a comment - - edited

          ... didn't quite get the DT registration right the first time, but it's committed again now.

          Still getting seeing a few failures when I make Leveled the default for the test suite. Most are bogus (e.g. test assuming it can request compaction of arbitrary sstables) but some are not:

          [junit] Testsuite: org.apache.cassandra.db.CleanupTest
          [junit] Invalid memory access of location 0x0 rip=0x0
          
          [junit] Test org.apache.cassandra.db.ColumnFamilyStoreTest FAILED (crashed)
          [junit] Testsuite: org.apache.cassandra.db.ColumnFamilyTest
          

          Actually, it's quite likely that these are manifestations of the CASSANDRA-3085 problem, which is more likely to bite us in practice now that there are many more sstables generated. So I'll address that first before worrying about these.

          Additional note: test suite runs about 20% slower for me w/ Leveled compactions. Unsure if that should be expected.

          Show
          Jonathan Ellis added a comment - - edited ... didn't quite get the DT registration right the first time, but it's committed again now. Still getting seeing a few failures when I make Leveled the default for the test suite. Most are bogus (e.g. test assuming it can request compaction of arbitrary sstables) but some are not: [junit] Testsuite: org.apache.cassandra.db.CleanupTest [junit] Invalid memory access of location 0x0 rip=0x0 [junit] Test org.apache.cassandra.db.ColumnFamilyStoreTest FAILED (crashed) [junit] Testsuite: org.apache.cassandra.db.ColumnFamilyTest Actually, it's quite likely that these are manifestations of the CASSANDRA-3085 problem, which is more likely to bite us in practice now that there are many more sstables generated. So I'll address that first before worrying about these. Additional note: test suite runs about 20% slower for me w/ Leveled compactions. Unsure if that should be expected.
          Hide
          Jonathan Ellis added a comment -

          Committed with some changes:

          • Switched DataTracker registration to be eager instead of lazy – otherwise sstables created before any compaction happened (during log replay for instance) would not be added to the manifest. Also added unregistration when a new Strategy is created on schema change.
          • added some debug logging to Manifest
          • moved the LevelDB classes from db.leveldb to db.compaction, so that I could add a "unqualified compaction strategy will be looked for in oac.db.compaction" rule
          • renamed LevelDB* to Leveled*
          • ran dos2unix on the intervaltree classes and reformatted
          • added code to make it not attempt to compact empty L0, which caused an assertion error
          • Added getUnleveledSSTables to CFSMBean

          Still to do:

          Show
          Jonathan Ellis added a comment - Committed with some changes: Switched DataTracker registration to be eager instead of lazy – otherwise sstables created before any compaction happened (during log replay for instance) would not be added to the manifest. Also added unregistration when a new Strategy is created on schema change. added some debug logging to Manifest moved the LevelDB classes from db.leveldb to db.compaction, so that I could add a "unqualified compaction strategy will be looked for in oac.db.compaction" rule renamed LevelDB* to Leveled* ran dos2unix on the intervaltree classes and reformatted added code to make it not attempt to compact empty L0, which caused an assertion error Added getUnleveledSSTables to CFSMBean Still to do: CASSANDRA-3085
          Hide
          Benjamin Coverston added a comment -

          Fixed up the patch according to the comments given. Took a stab a culling some of the SSTables from the locking mechanism.

          Show
          Benjamin Coverston added a comment - Fixed up the patch according to the comments given. Took a stab a culling some of the SSTables from the locking mechanism.
          Hide
          Benjamin Coverston added a comment - - edited

          I've made the changes requested in the last two comments. The latest changes/merge seem to have caused a regression when the # of SSTables increases beyond a few hundred. Next time I'll be able to look at this is Friday I'll try to figure out what on earth is going on.

          EDIT:
          Somehow I screwed up the attached patch.. I'll fix it and resubmit.

          Show
          Benjamin Coverston added a comment - - edited I've made the changes requested in the last two comments. The latest changes/merge seem to have caused a regression when the # of SSTables increases beyond a few hundred. Next time I'll be able to look at this is Friday I'll try to figure out what on earth is going on. EDIT: Somehow I screwed up the attached patch.. I'll fix it and resubmit.
          Hide
          Jonathan Ellis added a comment -

          I assumed that as I was doing operations on these SSTables in the referenced views I would also need to use these referenced.

          Right, but isn't that already the case pre-1608?

          I'm just trying to understand if this fixes an existing bug, or if I'm missing how the usage changed.

          Yes this is a potentially serious issue. This code gets called on every read.

          Feels like DataTracker is the wrong granularity to be doing reference counting at. I'd like to see if we can push that into SSTableReader instead. But let's do that in another ticket, for our purposes here correctness will suffice and we can optimize later.

          Show
          Jonathan Ellis added a comment - I assumed that as I was doing operations on these SSTables in the referenced views I would also need to use these referenced. Right, but isn't that already the case pre-1608? I'm just trying to understand if this fixes an existing bug, or if I'm missing how the usage changed. Yes this is a potentially serious issue. This code gets called on every read. Feels like DataTracker is the wrong granularity to be doing reference counting at. I'd like to see if we can push that into SSTableReader instead. But let's do that in another ticket, for our purposes here correctness will suffice and we can optimize later.
          Hide
          Benjamin Coverston added a comment -

          What is the 1.25 supposed to be doing here?

          dunno what I was thinking, I was screwing around with giving the promoted range a size, but it looks like that ended up in the wrong place.

          Why the "all on the same level" special case? Is this just saying "L0 compactions must go into L1?"

          Yes, also when a compaction gets triggered into an empty target level the same logic applies.

          removed this. if L0 is large, it doesn't necessarily follow that L1 is large too. I don't see a good reason to second-guess the scoring here.

          Actually this was there to prevent an OOM exception when too many SSTables were participating in any given compaction. You are, however correct that it doesn't follow that L1 is large, not in all cases. I'll revise this to give an upper bound to the list of L0 candidates in a given compaction.

          L0 only gets two sstables before it's overcapacity? Are we still allowing L0 sstables to be large? if so it's not even two

          I was screwing around with this threshold. One of the side effects of the dynamic flush thresholds was that I could end up with a substantial number of small SSTables "stuck" in L0. One way to fix this is to always give L0 a small positive score when there are any SSTables in L0 so that the SSTables get cleared out with the rest of the leveling has been done. Previously I was using the memtable flush threshold as the multiplier for L0, but with dynamic flushing and global memtable thresholds this doesn't mean much anymore. I'm included to leave it and perhaps raise the multiplier for L0 from 2 to 4.

          .bq "Exposing number of SSTables in L0 as a JMX property probably isn't a bad idea."

          I'll get this in

          .bq it's not correct for the create/load code to assume that the first data directory stays constant across restarts – it should check all directories when loading

          I'll fix this

          CFS
          ===

          • not immediately clear to me if the TODOs in isKeyInRemainingSSTables are something i should be concerned about

          I cleaned this up

          • why do we need the reference mark/unmark now but not before? is this a bug fix independent of 1608?

          .bq Use reference counting to delete sstables instead of relying on the GC patch by slebresne; reviewed by jbellis for CASSANDRA-2521 git-svn-id: https://svn.apache.org/repos/asf/cassandra/trunk@1149085 13f79535-47bb-0310-9956-ffa450edef68

          I assumed that as I was doing operations on these SSTables in the referenced views I would also need to use these referenced.

          • are we losing a lot of cycles to markCurrentViewReferenced on the read path now that this is 1000s of sstables instead of 10s?

          Yes this is a potentially serious issue. This code gets called on every read. A pretty heavy price to pay during each read.

          Show
          Benjamin Coverston added a comment - What is the 1.25 supposed to be doing here? dunno what I was thinking, I was screwing around with giving the promoted range a size, but it looks like that ended up in the wrong place. Why the "all on the same level" special case? Is this just saying "L0 compactions must go into L1?" Yes, also when a compaction gets triggered into an empty target level the same logic applies. removed this. if L0 is large, it doesn't necessarily follow that L1 is large too. I don't see a good reason to second-guess the scoring here. Actually this was there to prevent an OOM exception when too many SSTables were participating in any given compaction. You are, however correct that it doesn't follow that L1 is large, not in all cases. I'll revise this to give an upper bound to the list of L0 candidates in a given compaction. L0 only gets two sstables before it's overcapacity? Are we still allowing L0 sstables to be large? if so it's not even two I was screwing around with this threshold. One of the side effects of the dynamic flush thresholds was that I could end up with a substantial number of small SSTables "stuck" in L0. One way to fix this is to always give L0 a small positive score when there are any SSTables in L0 so that the SSTables get cleared out with the rest of the leveling has been done. Previously I was using the memtable flush threshold as the multiplier for L0, but with dynamic flushing and global memtable thresholds this doesn't mean much anymore. I'm included to leave it and perhaps raise the multiplier for L0 from 2 to 4. .bq "Exposing number of SSTables in L0 as a JMX property probably isn't a bad idea." I'll get this in .bq it's not correct for the create/load code to assume that the first data directory stays constant across restarts – it should check all directories when loading I'll fix this CFS === not immediately clear to me if the TODOs in isKeyInRemainingSSTables are something i should be concerned about I cleaned this up why do we need the reference mark/unmark now but not before? is this a bug fix independent of 1608? .bq Use reference counting to delete sstables instead of relying on the GC patch by slebresne; reviewed by jbellis for CASSANDRA-2521 git-svn-id: https://svn.apache.org/repos/asf/cassandra/trunk@1149085 13f79535-47bb-0310-9956-ffa450edef68 I assumed that as I was doing operations on these SSTables in the referenced views I would also need to use these referenced. are we losing a lot of cycles to markCurrentViewReferenced on the read path now that this is 1000s of sstables instead of 10s? Yes this is a potentially serious issue. This code gets called on every read. A pretty heavy price to pay during each read.
          Hide
          Jonathan Ellis added a comment -

          v4 attached.

          Manifest
          ========

          • I noticed that Manifest.generations and lastCompactedKeys could be simplified to arrays if we are willing to assume that no node will have more than a PB or so of data in a single CF. Which feels reasonable to me even with capacity expanding as fast as it is.
          • What is the 1.25 supposed to be doing here?
                    // skip newlevel if the resulting sstables exceed newlevel threshold
                    if (maxBytesForLevel(newLevel) < SSTableReader.getTotalBytes(added)
                        && SSTableReader.getTotalBytes(getLevel(newLevel + 1)) == 0 * 1.25)
            
          • Why the "all on the same level" special case? Is this just saying "L0 compactions must go into L1?"
                    // the level for the added sstables is the max of the removed ones,
                    // plus one if the removed were all on the same level
            
          • removed this. if L0 is large, it doesn't necessarily follow that L1 is large too. I don't see a good reason to second-guess the scoring here.
                        if (candidates.size() > 32 && bestLevel == 0)
                        {
                            candidates = getCandidatesFor(1);
                        }
            
          • redid L0 candidate selection to follow the LevelDB algorithm (pick one L0, add other L0s and L1s that overlap). This means that if we're doing sequential writes we don't do "extra" work compacting non-overlapping L0s unnecessarily. (A niche use to be sure given our emphasis on RP but it's not a lot of code.)
          • L0 only gets two sstables before it's overcapacity? Are we still allowing L0 sstables to be large? if so it's not even two
          • "Exposing number of SSTables in L0 as a JMX property probably isn't a bad idea."
          • it's not correct for the create/load code to assume that the first data directory stays constant across restarts – it should check all directories when loading

          CFS
          ===

          • not immediately clear to me if the TODOs in isKeyInRemainingSSTables are something i should be concerned about
          • why do we need the reference mark/unmark now but not before? is this a bug fix independent of 1608?
          • are we losing a lot of cycles to markCurrentViewReferenced on the read path now that this is 1000s of sstables instead of 10s?

          DataTracker
          ===========

          • followed todo's suggestion to move incrementallyBackup to another thread
          • why do we use a LinkedList in buildIntervalTree when we know the size beforehand?
          • suspect that it's going to be faster to use interval tree to prune the search space for CollationController.collectTimeOrderedData, then sort that subset by timestamp. Which would simplify DataTracker by not having to keep a list of sstables around sorted-by-timestamp – could get rid of that entirely in favor of the tree, I think.

          Compaction
          ==========

          • Did this code get moved somewhere else so manual compaction request against a single sstable remains a no-op for SizeTiered?
                        if (toCompact.size() < 2)
                        {
                            logger.info("Nothing to compact in " + cfs.getColumnFamilyName() + "." +
                                        "Use forceUserDefinedCompaction if you wish to force compaction of single sstables " +
                                        "(e.g. for tombstone collection)");
                            return 0;
                        }
            
          Show
          Jonathan Ellis added a comment - v4 attached. Manifest ======== I noticed that Manifest.generations and lastCompactedKeys could be simplified to arrays if we are willing to assume that no node will have more than a PB or so of data in a single CF. Which feels reasonable to me even with capacity expanding as fast as it is. What is the 1.25 supposed to be doing here? // skip newlevel if the resulting sstables exceed newlevel threshold if (maxBytesForLevel(newLevel) < SSTableReader.getTotalBytes(added) && SSTableReader.getTotalBytes(getLevel(newLevel + 1)) == 0 * 1.25) Why the "all on the same level" special case? Is this just saying "L0 compactions must go into L1?" // the level for the added sstables is the max of the removed ones, // plus one if the removed were all on the same level removed this. if L0 is large, it doesn't necessarily follow that L1 is large too. I don't see a good reason to second-guess the scoring here. if (candidates.size() > 32 && bestLevel == 0) { candidates = getCandidatesFor(1); } redid L0 candidate selection to follow the LevelDB algorithm (pick one L0, add other L0s and L1s that overlap). This means that if we're doing sequential writes we don't do "extra" work compacting non-overlapping L0s unnecessarily. (A niche use to be sure given our emphasis on RP but it's not a lot of code.) L0 only gets two sstables before it's overcapacity? Are we still allowing L0 sstables to be large? if so it's not even two "Exposing number of SSTables in L0 as a JMX property probably isn't a bad idea." it's not correct for the create/load code to assume that the first data directory stays constant across restarts – it should check all directories when loading CFS === not immediately clear to me if the TODOs in isKeyInRemainingSSTables are something i should be concerned about why do we need the reference mark/unmark now but not before? is this a bug fix independent of 1608? are we losing a lot of cycles to markCurrentViewReferenced on the read path now that this is 1000s of sstables instead of 10s? DataTracker =========== followed todo's suggestion to move incrementallyBackup to another thread why do we use a LinkedList in buildIntervalTree when we know the size beforehand? suspect that it's going to be faster to use interval tree to prune the search space for CollationController.collectTimeOrderedData, then sort that subset by timestamp. Which would simplify DataTracker by not having to keep a list of sstables around sorted-by-timestamp – could get rid of that entirely in favor of the tree, I think. Compaction ========== Did this code get moved somewhere else so manual compaction request against a single sstable remains a no-op for SizeTiered? if (toCompact.size() < 2) { logger.info( "Nothing to compact in " + cfs.getColumnFamilyName() + "." + "Use forceUserDefinedCompaction if you wish to force compaction of single sstables " + "(e.g. for tombstone collection)" ); return 0; }
          Hide
          Benjamin Coverston added a comment -

          added synchronization to add

          Show
          Benjamin Coverston added a comment - added synchronization to add
          Hide
          Alan Liang added a comment -

          From a high level, it's looking good.

          In Manifest.java, either "public void add(SSTableReader reader)" should be should be synchronized or use a NonBlockingHashMap to hold generations because multiple threads could be calling this.

          Show
          Alan Liang added a comment - From a high level, it's looking good. In Manifest.java, either "public void add(SSTableReader reader)" should be should be synchronized or use a NonBlockingHashMap to hold generations because multiple threads could be calling this.
          Hide
          Benjamin Coverston added a comment -

          Rebased and updated with some fixes. All tests should now pass.

          Show
          Benjamin Coverston added a comment - Rebased and updated with some fixes. All tests should now pass.
          Hide
          Benjamin Coverston added a comment -

          It should. I'll have an update patch today.

          Show
          Benjamin Coverston added a comment - It should. I'll have an update patch today.
          Hide
          Jeremy Hanna added a comment -

          Is this going to make it into the 1.0 release? Seems like it's awfully close.

          Show
          Jeremy Hanna added a comment - Is this going to make it into the 1.0 release? Seems like it's awfully close.
          Hide
          Benjamin Coverston added a comment -

          Thanks Alan, you're right. I'm still working on this. I appreciate that you're looking at it.

          Ben

          Show
          Benjamin Coverston added a comment - Thanks Alan, you're right. I'm still working on this. I appreciate that you're looking at it. Ben
          Hide
          Alan Liang added a comment - - edited

          There's a problem with Interval#intersects:

          public boolean intersects(Interval interval)
          {
              return this.contains(interval.min) || this.contains(interval.min);
          }
          

          I think you wanted:

              return this.contains(interval.min) || this.contains(interval.max);
          

          However, a more efficient way to do this would be:

              return this.min.compareTo(interval.max) <= 0 && this.max.compareTo(interval.min) >= 0;
          
          Show
          Alan Liang added a comment - - edited There's a problem with Interval#intersects: public boolean intersects(Interval interval) { return this .contains(interval.min) || this .contains(interval.min); } I think you wanted: return this .contains(interval.min) || this .contains(interval.max); However, a more efficient way to do this would be: return this .min.compareTo(interval.max) <= 0 && this .max.compareTo(interval.min) >= 0;
          Hide
          Benjamin Coverston added a comment -

          1608 without some of the cruft

          Show
          Benjamin Coverston added a comment - 1608 without some of the cruft
          Hide
          Benjamin Coverston added a comment - - edited

          We know the input and output keys, yes.

          If we isolate the problem to concurrent compactions in the same level, and staggered levels

          {L2, L3}

          ,

          {L4, L5}

          it is certainly an easier problem.

          Show
          Benjamin Coverston added a comment - - edited We know the input and output keys, yes. If we isolate the problem to concurrent compactions in the same level, and staggered levels {L2, L3} , {L4, L5} it is certainly an easier problem.
          Hide
          Jonathan Ellis added a comment -

          Where I was going is if we are compacting

          {L2.1, L3.1, L3.2, ..., L3.11}

          we can also compact

          {L2.9, L3.90, L3.91, ..., L3.99}

          , for instance. Because if the input keys are non-overlapping, we know that the output keys will be as well. Right?

          Show
          Jonathan Ellis added a comment - Where I was going is if we are compacting {L2.1, L3.1, L3.2, ..., L3.11} we can also compact {L2.9, L3.90, L3.91, ..., L3.99} , for instance. Because if the input keys are non-overlapping, we know that the output keys will be as well. Right?
          Hide
          Benjamin Coverston added a comment -

          .bq Could we allow concurrency between non-overlapping compactions? I assume that's what makes things tricky.

          TL;DR: Yes, but we would have to disable the ability of the DataTracker to evict some sstables from a currently scheduled compaction and instead have it remove all of them effectively canceling the scheduled compaction.

          A compaction from L0 to L1 causes every SSTable in L1 to be turned over. Because of that I can't schedule a concurrent compaction from

          {L0, L1} and {L1, L2} without running the risk of having the source SSTable in the {L1, L2} compaction from being evicted by the data tracker.

          After L1 it gets a bit easier. If I can force compactions to complete and enter a serial mode for {L0, L1}

          or at least tie L0, L1, and L2 together then concurrently compacting higher levels is a little easier and falls under a scenario of exclusion.

          L2 - 1. . .10
          L3 - 1 . . . 100

          We start by choosing L2.1 and find that it overlaps L3.1-11 so we start a compaction with 12 SSTables. At this point –

          We don't know what the state of the levels will look like after the compaction is complete, but we can guess.

          If we think that after removing max_sstable_size from L2 we'll still need more compactions we can move on to scheduling a new compaction using L2.2, but L2.2 may still overlap one of the sstables in L3.1-11. So we could choose a new candidate from that level until we find a non-overlapping candidate for the currently compacting sstables into the next level.

          Show
          Benjamin Coverston added a comment - .bq Could we allow concurrency between non-overlapping compactions? I assume that's what makes things tricky. TL;DR: Yes, but we would have to disable the ability of the DataTracker to evict some sstables from a currently scheduled compaction and instead have it remove all of them effectively canceling the scheduled compaction. A compaction from L0 to L1 causes every SSTable in L1 to be turned over. Because of that I can't schedule a concurrent compaction from {L0, L1} and {L1, L2} without running the risk of having the source SSTable in the {L1, L2} compaction from being evicted by the data tracker. After L1 it gets a bit easier. If I can force compactions to complete and enter a serial mode for {L0, L1} or at least tie L0, L1, and L2 together then concurrently compacting higher levels is a little easier and falls under a scenario of exclusion. L2 - 1. . .10 L3 - 1 . . . 100 We start by choosing L2.1 and find that it overlaps L3.1-11 so we start a compaction with 12 SSTables. At this point – We don't know what the state of the levels will look like after the compaction is complete, but we can guess. If we think that after removing max_sstable_size from L2 we'll still need more compactions we can move on to scheduling a new compaction using L2.2, but L2.2 may still overlap one of the sstables in L3.1-11. So we could choose a new candidate from that level until we find a non-overlapping candidate for the currently compacting sstables into the next level.
          Hide
          Jonathan Ellis added a comment -

          you should know that the synchronization code for this basically disables concurrent compactions

          Hmm, I didn't see that coming. Could we allow concurrency between non-overlapping compactions? I assume that's what makes things tricky.

          Show
          Jonathan Ellis added a comment - you should know that the synchronization code for this basically disables concurrent compactions Hmm, I didn't see that coming. Could we allow concurrency between non-overlapping compactions? I assume that's what makes things tricky.
          Hide
          Benjamin Coverston added a comment -

          It's a very simplistic trigger. You'll find it in the manifest code:

          in Manifest.Promote:

          int newLevel = minimumLevel == maximumLevel ? maximumLevel + 1 : maximumLevel;

          newLevel = skipLevels(newLevel, added);

          This fixes the case where you have one large SSTable and you want to migrate that SSTable to the new format. After a compaction it will find an empty level for the new, non-overlapping set to fall into. Best case scenario for someone migrating a lot of data with this code would be to do a major compaction (tiered) followed by changing the compaction strategy to leveled.

          Also, you should know that the synchronization code for this basically disables concurrent compactions. I'm not sure if that's going to be an issue. To re-enable concurrency in a sane manner I would need to introduce quite a bit of complexity into the system. It's my hope that the work we're doing to improve the concurrency of single compactions combined with the shorter duration of each compaction in leveldb will make this a non-issue.

          Show
          Benjamin Coverston added a comment - It's a very simplistic trigger. You'll find it in the manifest code: in Manifest.Promote: int newLevel = minimumLevel == maximumLevel ? maximumLevel + 1 : maximumLevel; newLevel = skipLevels(newLevel, added); This fixes the case where you have one large SSTable and you want to migrate that SSTable to the new format. After a compaction it will find an empty level for the new, non-overlapping set to fall into. Best case scenario for someone migrating a lot of data with this code would be to do a major compaction (tiered) followed by changing the compaction strategy to leveled. Also, you should know that the synchronization code for this basically disables concurrent compactions. I'm not sure if that's going to be an issue. To re-enable concurrency in a sane manner I would need to introduce quite a bit of complexity into the system. It's my hope that the work we're doing to improve the concurrency of single compactions combined with the shorter duration of each compaction in leveldb will make this a non-issue.
          Hide
          Jonathan Ellis added a comment -

          Where should I look for that? What triggers level-skipping?

          Show
          Jonathan Ellis added a comment - Where should I look for that? What triggers level-skipping?
          Hide
          Benjamin Coverston added a comment -

          Added level skipping logic.

          Show
          Benjamin Coverston added a comment - Added level skipping logic.
          Hide
          Ryan King added a comment -

          bq. Is it even worth keeping bloom filters around with such a drastic reduction in worst-case number of sstables to check (for read path too)?

          I think they are absolutely worth keeping around for unleveled sstables, but for leveled sstables the value is certainly questionable. Perhaps having some kind of LRU cache where we have an upper bound on the number of bloom filters we keep in memory would be wise. Is it possible that we could move these off-heap?

          I admit that I probably don't fully understand this change, but we have at least one workload where keeping BFs would probably be necessary– the vast majority of the traffic on that workload is for keys that don't exist anywhere. Even small bumps in BF false positive rates greatly effect the read performance.

          Show
          Ryan King added a comment - bq. Is it even worth keeping bloom filters around with such a drastic reduction in worst-case number of sstables to check (for read path too)? I think they are absolutely worth keeping around for unleveled sstables, but for leveled sstables the value is certainly questionable. Perhaps having some kind of LRU cache where we have an upper bound on the number of bloom filters we keep in memory would be wise. Is it possible that we could move these off-heap? I admit that I probably don't fully understand this change, but we have at least one workload where keeping BFs would probably be necessary– the vast majority of the traffic on that workload is for keys that don't exist anywhere. Even small bumps in BF false positive rates greatly effect the read performance.
          Hide
          Benjamin Coverston added a comment -

          Not a deal breaker for me – it's not hard to get old-style compactions to back up under sustained writes, either. Given a choice between "block writes until compactions catch up" or "let them back up and let the operater deal with it how he will," I'll take the latter.

          Exposing number of SSTables in L0 as a JMX property probably isn't a bad idea.

          Is it even worth keeping bloom filters around with such a drastic reduction in worst-case number of sstables to check (for read path too)?

          I think they are absolutely worth keeping around for unleveled sstables, but for leveled sstables the value is certainly questionable. Perhaps having some kind of LRU cache where we have an upper bound on the number of bloom filters we keep in memory would be wise. Is it possible that we could move these off-heap?

          I'd like to have a better understanding of what the tradeoff is between making these settings larger/smaller. Can we make these one-size-fits-all?

          Some pros and cons here. The biggest con is that for a 64MB flushed sstable leveling that file when we choose a 25MB leveled size will require us to run a compaction on approximately 314MB of data (25MB * 10 + 64MB) to get the data leveled into L1. If we choose 50MB for our leveled size the math is the same, but we end up compacting 564MB of data. Taking into account level based scoring (to choose the next compaction candidates), these settings become somewhat dynamic and the interplay between flush size and sstable size is anything but subtle. A small leveled size in combination with a large flushing memtable means that each time you merge a flushed SSTable into L1 you could end up with many cycles of cascading compactions into L2, and potentially into L3 and higher until the scores for L1, L2, and L3 normalize into a range that again triggers compactions from L0 to L1.

          I wanted to keep the time for each compaction to something < 10 seconds so I chose an sstable size in the range of 5-10 MB and that was effective.

          I like the idea of having a one-size-fits-all setting for this, but whatever I choose I think that compaction is going force me to revisit it. Right now this setting is part of the schema, and it's a nested schema setting at that. I'm leaning toward "undocumented-setting" right now with a reasonable default.

          Show
          Benjamin Coverston added a comment - Not a deal breaker for me – it's not hard to get old-style compactions to back up under sustained writes, either. Given a choice between "block writes until compactions catch up" or "let them back up and let the operater deal with it how he will," I'll take the latter. Exposing number of SSTables in L0 as a JMX property probably isn't a bad idea. Is it even worth keeping bloom filters around with such a drastic reduction in worst-case number of sstables to check (for read path too)? I think they are absolutely worth keeping around for unleveled sstables, but for leveled sstables the value is certainly questionable. Perhaps having some kind of LRU cache where we have an upper bound on the number of bloom filters we keep in memory would be wise. Is it possible that we could move these off-heap? I'd like to have a better understanding of what the tradeoff is between making these settings larger/smaller. Can we make these one-size-fits-all? Some pros and cons here. The biggest con is that for a 64MB flushed sstable leveling that file when we choose a 25MB leveled size will require us to run a compaction on approximately 314MB of data (25MB * 10 + 64MB) to get the data leveled into L1. If we choose 50MB for our leveled size the math is the same, but we end up compacting 564MB of data. Taking into account level based scoring (to choose the next compaction candidates), these settings become somewhat dynamic and the interplay between flush size and sstable size is anything but subtle. A small leveled size in combination with a large flushing memtable means that each time you merge a flushed SSTable into L1 you could end up with many cycles of cascading compactions into L2, and potentially into L3 and higher until the scores for L1, L2, and L3 normalize into a range that again triggers compactions from L0 to L1. I wanted to keep the time for each compaction to something < 10 seconds so I chose an sstable size in the range of 5-10 MB and that was effective. I like the idea of having a one-size-fits-all setting for this, but whatever I choose I think that compaction is going force me to revisit it. Right now this setting is part of the schema, and it's a nested schema setting at that. I'm leaning toward "undocumented-setting" right now with a reasonable default.
          Hide
          Jonathan Ellis added a comment -

          The interval tree does a good job here making sure that bloom filters are only queried only for those SSTables that fall into the queried range

          Is it even worth keeping bloom filters around with such a drastic reduction in worst-case number of sstables to check (for read path too)?

          Compactions do back up

          Not a deal breaker for me – it's not hard to get old-style compactions to back up under sustained writes, either. Given a choice between "block writes until compactions catch up" or "let them back up and let the operater deal with it how he will," I'll take the latter.

          flush size to 64MB and the leveled SSTable size to anywhere between 5-10MB

          I'd like to have a better understanding of what the tradeoff is between making these settings larger/smaller. Can we make these one-size-fits-all?

          For datasets that frequently overwrite old data that has already been flushed to disk there is the potential for substantial de-duplication of data

          Yes, this is a big win. Even people who will never fill up half their disk, complain about the worst-case major compaction scenario for old-style compaction.

          Show
          Jonathan Ellis added a comment - The interval tree does a good job here making sure that bloom filters are only queried only for those SSTables that fall into the queried range Is it even worth keeping bloom filters around with such a drastic reduction in worst-case number of sstables to check (for read path too)? Compactions do back up Not a deal breaker for me – it's not hard to get old-style compactions to back up under sustained writes, either. Given a choice between "block writes until compactions catch up" or "let them back up and let the operater deal with it how he will," I'll take the latter. flush size to 64MB and the leveled SSTable size to anywhere between 5-10MB I'd like to have a better understanding of what the tradeoff is between making these settings larger/smaller. Can we make these one-size-fits-all? For datasets that frequently overwrite old data that has already been flushed to disk there is the potential for substantial de-duplication of data Yes, this is a big win. Even people who will never fill up half their disk, complain about the worst-case major compaction scenario for old-style compaction.
          Hide
          Benjamin Coverston added a comment -

          Updated s.t. manifests are now in the data directory. rebased.

          Show
          Benjamin Coverston added a comment - Updated s.t. manifests are now in the data directory. rebased.
          Hide
          Benjamin Coverston added a comment -

          First the good:

          1. Modified the code s.t. tombstone purge during minor compactions use the interval tree to prune the list of SSTables speeding up compactions by at least an order of magnitude where the number of SSTables in a column family exceeds ~500.

          2. Tested reads and writes. Write speeds (unsurprisingly) are not affected by this compaction strategy. Reads seem to keep up as well. The interval tree does a good job here making sure that bloom filters are only queried only for those SSTables that fall into the queried range.

          3. Three successive runs of stress inserting 10M keys resulted in ~3GB of data stored in leveldb. By comparison, the same run using the tiered (default) strategy resulted in ~8GB of data.

          The Meh:

          Compactions do back up when setting the flush size to 64MB and the leveled SSTable size to anywhere between 5-10MB. On the upside, if your load has peaks and quieter times this compaction strategy will trigger a periodic check to "catch up" if all event-scheduled compactions complete.

          Interestingly this extra IO has an upside. For datasets that frequently overwrite old data that has already been flushed to disk there is the potential for substantial de-duplication of data. Further, during reads the number of rows that would need to be merged for a single row is bound by the number of levels + the number of un-leveled sstables.

          Show
          Benjamin Coverston added a comment - First the good: 1. Modified the code s.t. tombstone purge during minor compactions use the interval tree to prune the list of SSTables speeding up compactions by at least an order of magnitude where the number of SSTables in a column family exceeds ~500. 2. Tested reads and writes. Write speeds (unsurprisingly) are not affected by this compaction strategy. Reads seem to keep up as well. The interval tree does a good job here making sure that bloom filters are only queried only for those SSTables that fall into the queried range. 3. Three successive runs of stress inserting 10M keys resulted in ~3GB of data stored in leveldb. By comparison, the same run using the tiered (default) strategy resulted in ~8GB of data. The Meh: Compactions do back up when setting the flush size to 64MB and the leveled SSTable size to anywhere between 5-10MB. On the upside, if your load has peaks and quieter times this compaction strategy will trigger a periodic check to "catch up" if all event-scheduled compactions complete. Interestingly this extra IO has an upside. For datasets that frequently overwrite old data that has already been flushed to disk there is the potential for substantial de-duplication of data. Further, during reads the number of rows that would need to be merged for a single row is bound by the number of levels + the number of un-leveled sstables.
          Hide
          Benjamin Coverston added a comment -

          Added an interval tree to cull sstables that are not needed for point and range queries.

          Show
          Benjamin Coverston added a comment - Added an interval tree to cull sstables that are not needed for point and range queries.
          Hide
          Benjamin Coverston added a comment - - edited

          You're right, I'm thick. I just had to go down to the node itself, the
          Entry object gives me the interface I need.

          Show
          Benjamin Coverston added a comment - - edited You're right, I'm thick. I just had to go down to the node itself, the Entry object gives me the interface I need.
          Hide
          Jonathan Ellis added a comment -

          Don't the NavigableMap methods (implemented by CSLM) give you the traversal api you'd need?

          Show
          Jonathan Ellis added a comment - Don't the NavigableMap methods (implemented by CSLM) give you the traversal api you'd need?
          Hide
          Benjamin Coverston added a comment -

          It won't because you can't do a simple binary search for a range, it's really a problem of intersection rather than matching, and comparators alone don't solve problem of: give me all the intersecting ranges for this set without having to compare every range for intersection.

          Nearly every interval intersection algorithm depends on tree traversal, and while many of the existing collections are based on binary, or red-black trees they don't expose the methods necessary for traversal, only the comparator is exposed used to build the tree and the models only expose either "iterate over everything" or "search for the thing I want".

          Show
          Benjamin Coverston added a comment - It won't because you can't do a simple binary search for a range, it's really a problem of intersection rather than matching, and comparators alone don't solve problem of: give me all the intersecting ranges for this set without having to compare every range for intersection. Nearly every interval intersection algorithm depends on tree traversal, and while many of the existing collections are based on binary, or red-black trees they don't expose the methods necessary for traversal, only the comparator is exposed used to build the tree and the models only expose either "iterate over everything" or "search for the thing I want".
          Hide
          Stu Hood added a comment -

          The problem here is that there aren't any really good RBtree or even binary tree implementations that I have found in the dependencies that we currently have

          Would ConcurrentSkipListMap work, or do you need concurrent operations on multiple items?

          Show
          Stu Hood added a comment - The problem here is that there aren't any really good RBtree or even binary tree implementations that I have found in the dependencies that we currently have Would ConcurrentSkipListMap work, or do you need concurrent operations on multiple items?
          Hide
          Benjamin Coverston added a comment -

          Added a patch with fixed range filters.

          With ~1100 sstables average latency is substantially increased (~5-10x). I'm pretty sure that in order to improve on this well need to implement an interval tree to get a non-linear search time for overlapping sstables in interval queries.

          The problem here is that there aren't any really good RBtree or even binary tree implementations that I have found in the dependencies that we currently have, and I really don't want to muddy this ticket up with that effort.

          There are some potentially useful structures in UIMA that I can use to base the implementation of an interval tree off of, but right now I'm leaning toward doing this in a separate ticket.

          Show
          Benjamin Coverston added a comment - Added a patch with fixed range filters. With ~1100 sstables average latency is substantially increased (~5-10x). I'm pretty sure that in order to improve on this well need to implement an interval tree to get a non-linear search time for overlapping sstables in interval queries. The problem here is that there aren't any really good RBtree or even binary tree implementations that I have found in the dependencies that we currently have, and I really don't want to muddy this ticket up with that effort. There are some potentially useful structures in UIMA that I can use to base the implementation of an interval tree off of, but right now I'm leaning toward doing this in a separate ticket.
          Hide
          Benjamin Coverston added a comment -

          Added range filters to reads.

          Show
          Benjamin Coverston added a comment - Added range filters to reads.
          Hide
          Benjamin Coverston added a comment -

          Fixed exception on startup.

          Show
          Benjamin Coverston added a comment - Fixed exception on startup.
          Hide
          Benjamin Coverston added a comment - - edited

          Attaching the latest version.

          Levels are scored according to size rather than number of SSTables.

          Performance kind of sucked until I made a few tweaks.

          You can modify the LevelDB SSTable size by the following command (example) in the CLI:

          update column family Standard1 with compaction_strategy_options=[

          { sstable_size_in_mb : 10 }

          ];

          Using a flush size of 64MB and an sstable_size_in_mb of 5 worked pretty well for keeping compactions moving through the levels and handling new SSTables as they entered the system.

          I also enabled concurrent compactions which, to my surprise, helped considerably. In testing I also removed compaction throttling, but in the end I don't think it mattered too much for me.

          This version also adds a manifest and recovery code to the mix. While running you can cat the manifest, it's human readable, and quite beautiful to see the levels interact with each other as SSTables are flushed and compactions roll through the levels.

          Right now I'm getting an exception on startup from the keycache, I'm going to investigate that, but I think it may have to do with the fact that I am initializing the compaction manager after the CFS.

          Show
          Benjamin Coverston added a comment - - edited Attaching the latest version. Levels are scored according to size rather than number of SSTables. Performance kind of sucked until I made a few tweaks. You can modify the LevelDB SSTable size by the following command (example) in the CLI: update column family Standard1 with compaction_strategy_options=[ { sstable_size_in_mb : 10 } ]; Using a flush size of 64MB and an sstable_size_in_mb of 5 worked pretty well for keeping compactions moving through the levels and handling new SSTables as they entered the system. I also enabled concurrent compactions which, to my surprise, helped considerably. In testing I also removed compaction throttling, but in the end I don't think it mattered too much for me. This version also adds a manifest and recovery code to the mix. While running you can cat the manifest, it's human readable, and quite beautiful to see the levels interact with each other as SSTables are flushed and compactions roll through the levels. Right now I'm getting an exception on startup from the keycache, I'm going to investigate that, but I think it may have to do with the fact that I am initializing the compaction manager after the CFS.
          Hide
          Jonathan Ellis added a comment -

          I checked what leveldb actually does: http://www.google.com/codesearch#mHLldehqYMA/trunk/db/version_set.cc, methods Finalize and PickCompaction.

          What it does is compute a score for each level, as the ratio of bytes in that level to desired bytes. For level 0, it computes files / desired files instead. (Apparently leveldb doesn't have row-level bloom filters, so merging on reads is extra painful.*) The level with the highest score is compacted.

          When compacting L0 the only special casing done by leveldb is that after picking the primary L0 file to compact, it will check other L0 files for overlapping-ness too. (Again, we can expect this to usually if not always be "all L0 files," but it's not much more code than a "always compact all L0 files" special case would be, so why not avoid some i/o if we can.)

          *I'm pretty sure that (a) we don't need to special case for this reason and (b) we should standardize on bytes instead of file count, the latter is too subject to inaccuracy from streamed files as mentioned and on later levels the fact that compaction results are not going to be clean – if we merge one sstable of size S from L with two of size S from L+1, odds are poor we'll end up with merged bytes divisible by S or even very close to it. The overwhelming likelihood is you end up with 2 of size S and one of size 0 < size < S. Do enough of these and using sstable count as an approximation for size gets pretty inaccurate. Fortunately a method to sum SSTableReader.length() would be easy enough to write instead.

          Show
          Jonathan Ellis added a comment - I checked what leveldb actually does: http://www.google.com/codesearch#mHLldehqYMA/trunk/db/version_set.cc , methods Finalize and PickCompaction. What it does is compute a score for each level, as the ratio of bytes in that level to desired bytes. For level 0, it computes files / desired files instead. (Apparently leveldb doesn't have row-level bloom filters, so merging on reads is extra painful.*) The level with the highest score is compacted. When compacting L0 the only special casing done by leveldb is that after picking the primary L0 file to compact, it will check other L0 files for overlapping-ness too. (Again, we can expect this to usually if not always be "all L0 files," but it's not much more code than a "always compact all L0 files" special case would be, so why not avoid some i/o if we can.) *I'm pretty sure that (a) we don't need to special case for this reason and (b) we should standardize on bytes instead of file count, the latter is too subject to inaccuracy from streamed files as mentioned and on later levels the fact that compaction results are not going to be clean – if we merge one sstable of size S from L with two of size S from L+1, odds are poor we'll end up with merged bytes divisible by S or even very close to it. The overwhelming likelihood is you end up with 2 of size S and one of size 0 < size < S. Do enough of these and using sstable count as an approximation for size gets pretty inaccurate. Fortunately a method to sum SSTableReader.length() would be easy enough to write instead.
          Hide
          Jonathan Ellis added a comment -

          The LDBCompaction task was changed to limit the size of the SSTables that are output by the compaction itself.

          Ah, totally makes sense. Wonder if we can refactor some more to avoid so much duplicate code.

          I'll make some more modifications to the manifest s.t. there is a single path for getting new SSTables (flushed and streamed) into the manifest. I found a bug on the plane today where they were getting added to the manifest, but they weren't being added to the queue

          I think I fixed that by getting rid of the queue. It was basically just L0 anyway.

          I like "Manifest.add()" [to L0] being The Single Path, feels pretty foolproof to me.

          There are some boundary cases where every SSTable that gets compacted will be in the same level. Most of them have to do with L+1 being empty.

          Also makes sense.

          RE: the order, it does feel like we should do lower levels before higher levels, however one thing that we have to do is make sure that level-1 stays at 10 SSTables. The algorithm dictates that all of the level-0 candidates get compacted with all of the candidates at level-1.

          Well, all the overlapping ones. Which is usually going to be all of them, but it's easy enough to check that we might as well on the off chance that we get to save some i/o.

          This means that you need to promote out of level-1 so that it is ~10 SSTables before you schedule a compaction for level-0 promotion.

          I'm not sure that necessarily follows. Compacting lower levels first means less duplicate recompaction from L+1 later. L0 is particularly important since lots of sstables in L0 means (potentially) lots of merging by readers.

          In any case, the comments in gCC talked about prioritizing L1 but the code actually prioritized L0 so I went with that.

          Show
          Jonathan Ellis added a comment - The LDBCompaction task was changed to limit the size of the SSTables that are output by the compaction itself. Ah, totally makes sense. Wonder if we can refactor some more to avoid so much duplicate code. I'll make some more modifications to the manifest s.t. there is a single path for getting new SSTables (flushed and streamed) into the manifest. I found a bug on the plane today where they were getting added to the manifest, but they weren't being added to the queue I think I fixed that by getting rid of the queue. It was basically just L0 anyway. I like "Manifest.add()" [to L0] being The Single Path, feels pretty foolproof to me. There are some boundary cases where every SSTable that gets compacted will be in the same level. Most of them have to do with L+1 being empty. Also makes sense. RE: the order, it does feel like we should do lower levels before higher levels, however one thing that we have to do is make sure that level-1 stays at 10 SSTables. The algorithm dictates that all of the level-0 candidates get compacted with all of the candidates at level-1. Well, all the overlapping ones. Which is usually going to be all of them, but it's easy enough to check that we might as well on the off chance that we get to save some i/o. This means that you need to promote out of level-1 so that it is ~10 SSTables before you schedule a compaction for level-0 promotion. I'm not sure that necessarily follows. Compacting lower levels first means less duplicate recompaction from L+1 later. L0 is particularly important since lots of sstables in L0 means (potentially) lots of merging by readers. In any case, the comments in gCC talked about prioritizing L1 but the code actually prioritized L0 so I went with that.
          Hide
          Benjamin Coverston added a comment -

          The LDBCompaction task was changed to limit the size of the SSTables that are output by the compaction itself. Once the size of rows compacted exceeds the size of the default size in MB then it creates a new SSTable:

          >>

          if(position > cfs.metadata.getMemtableThroughputInMb() * 1024 * 1024

          nni.hasNext() == false)
          {
          <<

          It feels like a bit of a hack because an optimal flush size may not always be an optimal storage size, but my goal was to try to keep the SSTable size in a reasonably small range to make compactions into into level 1 fast.

          I'll make some more modifications to the manifest s.t. there is a single path for getting new SSTables (flushed and streamed) into the manifest. I found a bug on the plane today where they were getting added to the manifest, but they weren't being added to the queue that I was adding flushed SSTables to. I'll get that into my next revision.

          >>
          In promote, do we need to check for all the removed ones being on the same level? I can't think of a scenario where we're not merging from multiple levels. If so I'd change that to an assert. (In fact there should be exactly two levels involved, right?)
          <<

          I considered this. There are some boundary cases where every SSTable that gets compacted will be in the same level. Most of them have to do with L+1 being empty. Also sending the SSTables through the same compaction path will evict expired tombstones before they end up in the next level where compactions become increasingly unlikely.

          >>
          Did some surgery on getCompactionCandidates. Generally renamed things to be more succinct. Feels like we getCompactionCandidates should do lower levels before doing higher levels?
          <<

          Let's just say my naming conventions have been shaped by different influences I wouldn't object to any of the new names you chose however.

          RE: the order, it does feel like we should do lower levels before higher levels, however one thing that we have to do is make sure that level-1 stays at 10 SSTables. The algorithm dictates that all of the level-0 candidates get compacted with all of the candidates at level-1. This means that you need to promote out of level-1 so that it is ~10 SSTables before you schedule a compaction for level-0 promotion. Right now tuning this so that it is performant is the biggest hurdle, I have made some improvements by watching the CompactionExecutor, but I have a feeling that making this work is going to require some subtle manipulation of the way that the CompactionExecutor handles tasks.

          Show
          Benjamin Coverston added a comment - The LDBCompaction task was changed to limit the size of the SSTables that are output by the compaction itself. Once the size of rows compacted exceeds the size of the default size in MB then it creates a new SSTable: >> if(position > cfs.metadata.getMemtableThroughputInMb() * 1024 * 1024 nni.hasNext() == false) { << It feels like a bit of a hack because an optimal flush size may not always be an optimal storage size, but my goal was to try to keep the SSTable size in a reasonably small range to make compactions into into level 1 fast. I'll make some more modifications to the manifest s.t. there is a single path for getting new SSTables (flushed and streamed) into the manifest. I found a bug on the plane today where they were getting added to the manifest, but they weren't being added to the queue that I was adding flushed SSTables to. I'll get that into my next revision. >> In promote, do we need to check for all the removed ones being on the same level? I can't think of a scenario where we're not merging from multiple levels. If so I'd change that to an assert. (In fact there should be exactly two levels involved, right?) << I considered this. There are some boundary cases where every SSTable that gets compacted will be in the same level. Most of them have to do with L+1 being empty. Also sending the SSTables through the same compaction path will evict expired tombstones before they end up in the next level where compactions become increasingly unlikely. >> Did some surgery on getCompactionCandidates. Generally renamed things to be more succinct. Feels like we getCompactionCandidates should do lower levels before doing higher levels? << Let's just say my naming conventions have been shaped by different influences I wouldn't object to any of the new names you chose however. RE: the order, it does feel like we should do lower levels before higher levels, however one thing that we have to do is make sure that level-1 stays at 10 SSTables. The algorithm dictates that all of the level-0 candidates get compacted with all of the candidates at level-1. This means that you need to promote out of level-1 so that it is ~10 SSTables before you schedule a compaction for level-0 promotion. Right now tuning this so that it is performant is the biggest hurdle, I have made some improvements by watching the CompactionExecutor, but I have a feeling that making this work is going to require some subtle manipulation of the way that the CompactionExecutor handles tasks.
          Hide
          Jonathan Ellis added a comment -

          Thanks, Ben. This is promising!

          I pretty much concententrated on the Manifest, which I moved to a top-level class. (Can you summarize what is different in LDBCompactionTask?)

          I don't think trying to build levels out of non-leveled data is useful. Even if you tried all permutations the odds of ending up with something useful are infinitesmally small. I'd suggest adding a startup hook instead to CompactionStrategy, and if we start up w/ unleveled SSTables we level them before doing anything else. (This will take a while, but not as long as leveling everything naively would, since we can just do a single compaction-of-everything, spitting out non-overlapping sstables of the desired size, and set those to the appropriate level.)

          Updated DataTracker to add streamed sstables to level 0. DataTracker public API probably needs a more thorough look though to see if we're missing anything. (Speaking of streaming, I think we do need to go by data size not sstable count b/c streamed sstables from repair can be arbitrarily large or small.)

          In promote, do we need to check for all the removed ones being on the same level? I can't think of a scenario where we're not merging from multiple levels. If so I'd change that to an assert. (In fact there should be exactly two levels involved, right?)

          Did some surgery on getCompactionCandidates. Generally renamed things to be more succinct. Feels like we getCompactionCandidates should do lower levels before doing higher levels?

          We'll also need to think about which parts of the strategy/manifest need to be threadsafe. (All of them?) Should definitely document this in AbstractCS.

          Show
          Jonathan Ellis added a comment - Thanks, Ben. This is promising! I pretty much concententrated on the Manifest, which I moved to a top-level class. (Can you summarize what is different in LDBCompactionTask?) I don't think trying to build levels out of non-leveled data is useful. Even if you tried all permutations the odds of ending up with something useful are infinitesmally small. I'd suggest adding a startup hook instead to CompactionStrategy, and if we start up w/ unleveled SSTables we level them before doing anything else. (This will take a while, but not as long as leveling everything naively would, since we can just do a single compaction-of-everything, spitting out non-overlapping sstables of the desired size, and set those to the appropriate level.) Updated DataTracker to add streamed sstables to level 0. DataTracker public API probably needs a more thorough look though to see if we're missing anything. (Speaking of streaming, I think we do need to go by data size not sstable count b/c streamed sstables from repair can be arbitrarily large or small.) In promote, do we need to check for all the removed ones being on the same level? I can't think of a scenario where we're not merging from multiple levels. If so I'd change that to an assert. (In fact there should be exactly two levels involved, right?) Did some surgery on getCompactionCandidates. Generally renamed things to be more succinct. Feels like we getCompactionCandidates should do lower levels before doing higher levels? We'll also need to think about which parts of the strategy/manifest need to be threadsafe. (All of them?) Should definitely document this in AbstractCS.
          Hide
          Benjamin Coverston added a comment -

          Adding a patch for leveldb style compaction. I see this as a 'good start' and I'm looking for some further input. I'm not going to be able to work on this for the next week or so so I'm putting it here to start some discussion on this approach.

          This implementation requires no durable manifest.

          Ranges are created at SSTable creation (flush or compaction) or sstable index creation.

          Exponent used for levels is 10.

          Preliminary runs show that high write rates do make level 0 to level 1 promotions back up substantially, but when cleared promotions out of level one seem to be very fast.

          I found the best performance by removing the compaction throughput throttling and setting concurrent compactors to 1.

          The SSTable size in this implementation is determined by the flush size in mb setting.

          The recovery path reads the list of SSTables, groups them by non-overlapping ranges then places each range in its appropriate level.

          Finally credit is due to the leveldb team as this design was inspired by the leveldb implementation.

          Show
          Benjamin Coverston added a comment - Adding a patch for leveldb style compaction. I see this as a 'good start' and I'm looking for some further input. I'm not going to be able to work on this for the next week or so so I'm putting it here to start some discussion on this approach. This implementation requires no durable manifest. Ranges are created at SSTable creation (flush or compaction) or sstable index creation. Exponent used for levels is 10. Preliminary runs show that high write rates do make level 0 to level 1 promotions back up substantially, but when cleared promotions out of level one seem to be very fast. I found the best performance by removing the compaction throughput throttling and setting concurrent compactors to 1. The SSTable size in this implementation is determined by the flush size in mb setting. The recovery path reads the list of SSTables, groups them by non-overlapping ranges then places each range in its appropriate level. Finally credit is due to the leveldb team as this design was inspired by the leveldb implementation.
          Hide
          Jonathan Ellis added a comment -

          This is probably the best option, create a new component type: METADATA_STORE which will hold namespaced key/value pairs on a per-sstable basis

          We've been sticking metadata into the statistics component, fwiw. It's easier to leave the filename the same, but it's definitely not just statistics anymore.

          Show
          Jonathan Ellis added a comment - This is probably the best option, create a new component type: METADATA_STORE which will hold namespaced key/value pairs on a per-sstable basis We've been sticking metadata into the statistics component, fwiw. It's easier to leave the filename the same, but it's definitely not just statistics anymore.
          Hide
          Benjamin Coverston added a comment - - edited

          There's probably nothing that prevents us from doing that. Is our goal here to replace compaction entirely?

          The manifest information consists, minimally of the level information and ranges. For us ranges are easy as they are readily available when the SSTables are read in at restart, flushing, or compaction.

          Taking a stab at this I made the compaction manager abstract, then created a concrete implementation for the current compaction implementation. Happily hacking on a level based compaction manager I kept running into a delemma: Where do I store the level information. There are a few options:

          1. The descriptor A hack, simple, but also adds information that probably wouldn't be used by any other compaction manager, yet it would be there. Unless we're moving head-long into a level-db approach I'm not super excited about this.

          2. Store it on a per-sstable basis in the sstable: To continue along this path I would like to have a standard place to put "extra" metadata in the sstables. A header of some sort. I like the idea of using a metadata block in the SSTables to store this type of information.

          3. Use an on-disk manifest. – Pro: only my compaction manager needs to deal with this information, but there is a non-trivial amount of bookeeping that would need to be done to ensure this is kept up to day and valid.

          EDIT:
          4. This is probably the best option, create a new component type: METADATA_STORE which will hold namespaced key/value pairs on a per-sstable basis.

          Show
          Benjamin Coverston added a comment - - edited There's probably nothing that prevents us from doing that. Is our goal here to replace compaction entirely? The manifest information consists, minimally of the level information and ranges. For us ranges are easy as they are readily available when the SSTables are read in at restart, flushing, or compaction. Taking a stab at this I made the compaction manager abstract, then created a concrete implementation for the current compaction implementation. Happily hacking on a level based compaction manager I kept running into a delemma: Where do I store the level information. There are a few options: 1. The descriptor A hack, simple, but also adds information that probably wouldn't be used by any other compaction manager, yet it would be there. Unless we're moving head-long into a level-db approach I'm not super excited about this. 2. Store it on a per-sstable basis in the sstable: To continue along this path I would like to have a standard place to put "extra" metadata in the sstables. A header of some sort. I like the idea of using a metadata block in the SSTables to store this type of information. 3. Use an on-disk manifest. – Pro: only my compaction manager needs to deal with this information, but there is a non-trivial amount of bookeeping that would need to be done to ensure this is kept up to day and valid. EDIT: 4. This is probably the best option, create a new component type: METADATA_STORE which will hold namespaced key/value pairs on a per-sstable basis.
          Hide
          Jonathan Ellis added a comment -

          I don't see anything preventing us from recording that information on a per-sstable basis, the way we do flush point information now (CASSANDRA-2419).

          Show
          Jonathan Ellis added a comment - I don't see anything preventing us from recording that information on a per-sstable basis, the way we do flush point information now ( CASSANDRA-2419 ).
          Hide
          Benjamin Coverston added a comment -

          I like the LevelDB approach in almost every way except that recovery becomes more complicated than our current scenario. Right now we can basically ship files from one server to another without any special consideration, the reliance on a manifest requires us to solve a new problem in the case where a manifest is lost, or we want to recover from a snapshot.

          Show
          Benjamin Coverston added a comment - I like the LevelDB approach in almost every way except that recovery becomes more complicated than our current scenario. Right now we can basically ship files from one server to another without any special consideration, the reliance on a manifest requires us to solve a new problem in the case where a manifest is lost, or we want to recover from a snapshot.
          Hide
          Jonathan Ellis added a comment -

          client supplied timestamps mean that you can't know that newer files supercede older ones

          Right, but I see that as an orthogonal concern. (CASSANDRA-2498 will provide similar benefits to reads as being able to sort by timestamp, but again, that's basically independent of compaction strategy.)

          2) the CF data model means that data for a given key in multiple sstables may need to be merged

          Also right. I don't think changing that (or changing that you may have to merge data from multiple sstables on reads) should be a goal for us.

          I guess I wasn't clear; I'm not proposing "let's try to make Cassandra fit in the leveldb design."

          Compaction is about "how do we avoid rewriting the same data over and over, while minimizing the space penalty from not doing overwrites" and the leveldb approach almost certainly does better on both of those metrics than our current approach, specfically because of the non-overlapping-sstables-within-a-Level approach. That's the interesting part to me.

          Show
          Jonathan Ellis added a comment - client supplied timestamps mean that you can't know that newer files supercede older ones Right, but I see that as an orthogonal concern. ( CASSANDRA-2498 will provide similar benefits to reads as being able to sort by timestamp, but again, that's basically independent of compaction strategy.) 2) the CF data model means that data for a given key in multiple sstables may need to be merged Also right. I don't think changing that (or changing that you may have to merge data from multiple sstables on reads) should be a goal for us. I guess I wasn't clear; I'm not proposing "let's try to make Cassandra fit in the leveldb design." Compaction is about "how do we avoid rewriting the same data over and over, while minimizing the space penalty from not doing overwrites" and the leveldb approach almost certainly does better on both of those metrics than our current approach, specfically because of the non-overlapping-sstables-within-a-Level approach. That's the interesting part to me.
          Hide
          Ryan King added a comment -

          I only read the LevelDB stuff briefly. I think there's a lot we can learn, but there's at least 2 challenges:

          1) client supplied timestamps mean that you can't know that newer files supercede older ones
          2) the CF data model means that data for a given key in multiple sstables may need to be merged

          Show
          Ryan King added a comment - I only read the LevelDB stuff briefly. I think there's a lot we can learn, but there's at least 2 challenges: 1) client supplied timestamps mean that you can't know that newer files supercede older ones 2) the CF data model means that data for a given key in multiple sstables may need to be merged
          Hide
          Jonathan Ellis added a comment -

          Sure. But while bitcask (for instance) takes advantage of that to do things we can't, I don't see anything in here like that.

          Show
          Jonathan Ellis added a comment - Sure. But while bitcask (for instance) takes advantage of that to do things we can't, I don't see anything in here like that.
          Hide
          Ryan King added a comment -

          Its important to remember that LevelDB is key/value, not a column family data model, so there are concerns and constraints that apply to cassandra which do not apply to LevelDB.

          Show
          Ryan King added a comment - Its important to remember that LevelDB is key/value, not a column family data model, so there are concerns and constraints that apply to cassandra which do not apply to LevelDB.
          Hide
          Jonathan Ellis added a comment -

          Here is the approach used by LevelDB (http://leveldb.googlecode.com/svn/trunk/doc/impl.html)

          "The set of sorted tables are organized into a sequence of levels. The sorted table generated from a [flush] is placed in a special young level (also called level-0). When the number of young files exceeds a certain threshold (currently four), all of the young files are merged together with all of the overlapping level-1 files to produce a sequence of new level-1 files (we create a new level-1 file for every 2MB of data.)

          "Files in the young level may contain overlapping keys. However files in other levels have distinct non-overlapping key ranges. Consider level number L where L >= 1. When the combined size of files in level-L exceeds (10^L) MB (i.e., 10MB for level-1, 100MB for level-2, ...), one file in level-L, and all of the overlapping files in level-(L+1) are merged to form a set of new files for level-(L+1). These merges have the effect of gradually migrating new updates from the young level to the largest level using only bulk reads and writes (i.e., minimizing expensive seeks)

          "When the size of level L exceeds its limit, we compact it in a background thread. The compaction picks a file from level L and all overlapping files from the next level L+1. Note that if a level-L file overlaps only part of a level-(L+1) file, the entire file at level-(L+1) is used as an input to the compaction and will be discarded after the compaction. Aside: because level-0 is special (files in it may overlap each other), we treat compactions from level-0 to level-1 specially: a level-0 compaction may pick more than one level-0 file in case some of these files overlap each other.

          "A compaction merges the contents of the picked files to produce a sequence of level-(L+1) files. We switch to producing a new level-(L+1) file after the current output file has reached the target file size (2MB). We also switch to a new output file when the key range of the current output file has grown enough to overlap more then ten level-(L+2) files. This last rule ensures that a later compaction of a level-(L+1) file will not pick up too much data from level-(L+2).

          "Compactions for a particular level rotate through the key space. In more detail, for each level L, we remember the ending key of the last compaction at level L. The next compaction for level L will pick the first file that starts after this key (wrapping around to the beginning of the key space if there is no such file).

          "Level-0 compactions will read up to four 1MB files from level-0, and at worst all the level-1 files (10MB). I.e., we will read 14MB and write 14MB.

          "Other than the special level-0 compactions, we will pick one 2MB file from level L. In the worst case, this will overlap ~ 12 files from level L+1 (10 because level-(L+1) is ten times the size of level-L, and another two at the boundaries since the file ranges at level-L will usually not be aligned with the file ranges at level-L+1). The compaction will therefore read 26MB and write 26MB. Assuming a disk IO rate of 100MB/s (ballpark range for modern drives), the worst compaction cost will be approximately 0.5 second.

          "If we throttle the background writing to something small, say 10% of the full 100MB/s speed, a compaction may take up to 5 seconds. If the user is writing at 10MB/s, we might build up lots of level-0 files (~50 to hold the 5*10MB). This may signficantly increase the cost of reads due to the overhead of merging more files together on every read.

          Then there is some discussion on possible solutions to this problem.

          Show
          Jonathan Ellis added a comment - Here is the approach used by LevelDB ( http://leveldb.googlecode.com/svn/trunk/doc/impl.html ) "The set of sorted tables are organized into a sequence of levels. The sorted table generated from a [flush] is placed in a special young level (also called level-0). When the number of young files exceeds a certain threshold (currently four), all of the young files are merged together with all of the overlapping level-1 files to produce a sequence of new level-1 files (we create a new level-1 file for every 2MB of data.) "Files in the young level may contain overlapping keys. However files in other levels have distinct non-overlapping key ranges. Consider level number L where L >= 1. When the combined size of files in level-L exceeds (10^L) MB (i.e., 10MB for level-1, 100MB for level-2, ...), one file in level-L, and all of the overlapping files in level-(L+1) are merged to form a set of new files for level-(L+1). These merges have the effect of gradually migrating new updates from the young level to the largest level using only bulk reads and writes (i.e., minimizing expensive seeks) "When the size of level L exceeds its limit, we compact it in a background thread. The compaction picks a file from level L and all overlapping files from the next level L+1. Note that if a level-L file overlaps only part of a level-(L+1) file, the entire file at level-(L+1) is used as an input to the compaction and will be discarded after the compaction. Aside: because level-0 is special (files in it may overlap each other), we treat compactions from level-0 to level-1 specially: a level-0 compaction may pick more than one level-0 file in case some of these files overlap each other. "A compaction merges the contents of the picked files to produce a sequence of level-(L+1) files. We switch to producing a new level-(L+1) file after the current output file has reached the target file size (2MB). We also switch to a new output file when the key range of the current output file has grown enough to overlap more then ten level-(L+2) files. This last rule ensures that a later compaction of a level-(L+1) file will not pick up too much data from level-(L+2). "Compactions for a particular level rotate through the key space. In more detail, for each level L, we remember the ending key of the last compaction at level L. The next compaction for level L will pick the first file that starts after this key (wrapping around to the beginning of the key space if there is no such file). "Level-0 compactions will read up to four 1MB files from level-0, and at worst all the level-1 files (10MB). I.e., we will read 14MB and write 14MB. "Other than the special level-0 compactions, we will pick one 2MB file from level L. In the worst case, this will overlap ~ 12 files from level L+1 (10 because level-(L+1) is ten times the size of level-L, and another two at the boundaries since the file ranges at level-L will usually not be aligned with the file ranges at level-L+1). The compaction will therefore read 26MB and write 26MB. Assuming a disk IO rate of 100MB/s (ballpark range for modern drives), the worst compaction cost will be approximately 0.5 second. "If we throttle the background writing to something small, say 10% of the full 100MB/s speed, a compaction may take up to 5 seconds. If the user is writing at 10MB/s, we might build up lots of level-0 files (~50 to hold the 5*10MB). This may signficantly increase the cost of reads due to the overhead of merging more files together on every read. Then there is some discussion on possible solutions to this problem.
          Hide
          Sylvain Lebresne added a comment -

          A little harder ? How can you return efficiently the 100 next rows in the sense of the OPP if you don't store the rows in that order within a node ?

          Show
          Sylvain Lebresne added a comment - A little harder ? How can you return efficiently the 100 next rows in the sense of the OPP if you don't store the rows in that order within a node ?
          Hide
          David Boxenhorn added a comment - - edited

          > Partitioning by token ranges is functionally equivalent to virtual nodes, no? Which in the OPP case means you now have to deal with intra-node load balancing.

          I don't see any reason why OPP has to be used within a node just because it is used between nodes. I think RP should always be used within a node, and the number of partitions should be chosen to keep SST size optimal.

          RP makes OPP range queries a little harder, but not much: All partitions must be queried to find the next row, but since range queries are done in batches (e.g. get the next 100 rows) I don't think it will slow things down.

          Show
          David Boxenhorn added a comment - - edited > Partitioning by token ranges is functionally equivalent to virtual nodes, no? Which in the OPP case means you now have to deal with intra-node load balancing. I don't see any reason why OPP has to be used within a node just because it is used between nodes. I think RP should always be used within a node, and the number of partitions should be chosen to keep SST size optimal. RP makes OPP range queries a little harder, but not much: All partitions must be queried to find the next row, but since range queries are done in batches (e.g. get the next 100 rows) I don't think it will slow things down.
          Hide
          T Jake Luciani added a comment -

          Right, I'm not saying it blindly makes that assumption, snitch would be passive, unchanged. There would be another method for identifying what nodes are currently compacting.

          Show
          T Jake Luciani added a comment - Right, I'm not saying it blindly makes that assumption, snitch would be passive, unchanged. There would be another method for identifying what nodes are currently compacting.
          Hide
          Jonathan Ellis added a comment -

          sure, that is what snitch already does. but inferring from "node X is slow" to "avoid compacting CF Y" is a leap it can't make, and shouldn't. that's the wrong hole for this peg.

          Show
          Jonathan Ellis added a comment - sure, that is what snitch already does. but inferring from "node X is slow" to "avoid compacting CF Y" is a leap it can't make, and shouldn't. that's the wrong hole for this peg.
          Hide
          T Jake Luciani added a comment -

          I think you could use snitch since the performance of the node doing compaction would degrade and dynamic snitch would naturally prefer the other replicas.

          Show
          T Jake Luciani added a comment - I think you could use snitch since the performance of the node doing compaction would degrade and dynamic snitch would naturally prefer the other replicas.
          Hide
          Jonathan Ellis added a comment -

          1) is a possibility but the implementation details would be very different. snitch has nothing to do with compaction and gossip is too high-latency.

          2) sounds like a great way to make things very very fragile

          Show
          Jonathan Ellis added a comment - 1) is a possibility but the implementation details would be very different. snitch has nothing to do with compaction and gossip is too high-latency. 2) sounds like a great way to make things very very fragile
          Hide
          T Jake Luciani added a comment -

          This seems to be a brainstorming session on compaction so here are the things I've been thinking.

          1) Compaction Coordination: read performance impact can be mitigated if we make sure replicas of data are not doing compaction at the same time as the other replicas. dynamic snitch would do bulk of the work. Just encode compaction workload into gossip?

          2) Compaction Sharing: Benefit of splitting SStable by token range is we can send the resulting SStables to replicas of this data rather than have these nodes compact their own data. this could help 1 node do the work of the others.

          Show
          T Jake Luciani added a comment - This seems to be a brainstorming session on compaction so here are the things I've been thinking. 1) Compaction Coordination: read performance impact can be mitigated if we make sure replicas of data are not doing compaction at the same time as the other replicas. dynamic snitch would do bulk of the work. Just encode compaction workload into gossip? 2) Compaction Sharing: Benefit of splitting SStable by token range is we can send the resulting SStables to replicas of this data rather than have these nodes compact their own data. this could help 1 node do the work of the others.
          Hide
          Oleg Anastasyev added a comment -

          Ryan: yes, the idea is exactly in splitting sstables by uniform token ranges. So, if total node's data is 4Gb and you split to 2 subranges, after major compaction you'll have 2 sstables, each with size around 2gb-4gb. If you split to 256 subranges, you'll have 256 sstables. It is unlikely, that any one of them will be close to 4gb otherwise your cluster is very badly unbalanced and you should rethink what you're really doing.

          IMHO it it much better to have 256 subranges by node than 2. No negative effect on performance, but compactions as well as disk space management become much lighter.

          Show
          Oleg Anastasyev added a comment - Ryan: yes, the idea is exactly in splitting sstables by uniform token ranges. So, if total node's data is 4Gb and you split to 2 subranges, after major compaction you'll have 2 sstables, each with size around 2gb-4gb. If you split to 256 subranges, you'll have 256 sstables. It is unlikely, that any one of them will be close to 4gb otherwise your cluster is very badly unbalanced and you should rethink what you're really doing. IMHO it it much better to have 256 subranges by node than 2. No negative effect on performance, but compactions as well as disk space management become much lighter.
          Hide
          Jonathan Ellis added a comment -

          I don't think we should partition internally unless we're willing to go all the way and implement vnodes.

          partitions are required for vnodes but not vice versa.

          Show
          Jonathan Ellis added a comment - I don't think we should partition internally unless we're willing to go all the way and implement vnodes. partitions are required for vnodes but not vice versa.
          Hide
          Ryan King added a comment -

          I don't think we should partition internally unless we're willing to go all the way and implement vnodes.

          Oleg- I think a simpler strategy for ensuring locality is you split them by token order.

          Let's say you want to limit SSTable size to 2GB. When you go to compact 2 2GB SSTables, you could end up with anywhere from 2-4GB of data (ok not exactly true, but go with it for now). This means you'll likely still be producing 2 SSTables. I propose that you split those two by token range.

          Show
          Ryan King added a comment - I don't think we should partition internally unless we're willing to go all the way and implement vnodes. Oleg- I think a simpler strategy for ensuring locality is you split them by token order. Let's say you want to limit SSTable size to 2GB. When you go to compact 2 2GB SSTables, you could end up with anywhere from 2-4GB of data (ok not exactly true, but go with it for now). This means you'll likely still be producing 2 SSTables. I propose that you split those two by token range.
          Hide
          Oleg Anastasyev added a comment -

          Yes, but having split individual node's data to more subranges we can make negative impact of not well balanced sstables practically negligible.

          Show
          Oleg Anastasyev added a comment - Yes, but having split individual node's data to more subranges we can make negative impact of not well balanced sstables practically negligible.
          Hide
          Jonathan Ellis added a comment -

          Partitioning by token ranges is functionally equivalent to virtual nodes, no? Which in the OPP case means you now have to deal with intra-node load balancing.

          Show
          Jonathan Ellis added a comment - Partitioning by token ranges is functionally equivalent to virtual nodes, no? Which in the OPP case means you now have to deal with intra-node load balancing.
          Hide
          Oleg Anastasyev added a comment -

          I propose to limit size of sstables not directly in megabytes, but by partitioning them by token ranges, like normal cassandra ring is implemented. This way, data are partitioned on cassandra cluster by token ranges across nodes, and , additionally on each node it is partitioned into several subranges of the node's token range.
          Each of token subranges is tracked separatedly and is stored in separate memtable and sstable set. Rows from different subranges are never stored in the same memtable and sstable. Compactions and other operations are running on sstables from exactly 1 subrange.

          This way complexities with rows tracking, prioretization, etc of fixed size sstables are eliminated, I believe and we have smaller sstables (well, not exactly at 500Mb limit each, but this is not the target, I think) and lighter compactions.

          Show
          Oleg Anastasyev added a comment - I propose to limit size of sstables not directly in megabytes, but by partitioning them by token ranges, like normal cassandra ring is implemented. This way, data are partitioned on cassandra cluster by token ranges across nodes, and , additionally on each node it is partitioned into several subranges of the node's token range. Each of token subranges is tracked separatedly and is stored in separate memtable and sstable set. Rows from different subranges are never stored in the same memtable and sstable. Compactions and other operations are running on sstables from exactly 1 subrange. This way complexities with rows tracking, prioretization, etc of fixed size sstables are eliminated, I believe and we have smaller sstables (well, not exactly at 500Mb limit each, but this is not the target, I think) and lighter compactions.
          Hide
          Stu Hood added a comment -

          > But I am concerned with strategies that imply having to keep significant amounts of data over time, such as anything based
          > on row-level frequency/recency of access.
          Although it probably sounds like I have bloom filter fever, we could use a bloom filter for this as well: filters based on multisets, (spectral bloom filters, count min sketches), allow storing approximate counts for values. The access counts for rows for the lifetime of a memtable could be stored in a filter attached to the memtable: at read time, the approximate access count of the row key could be checked against a threshold of accesses, and be stored in the memtable as superseding.

          Show
          Stu Hood added a comment - > But I am concerned with strategies that imply having to keep significant amounts of data over time, such as anything based > on row-level frequency/recency of access. Although it probably sounds like I have bloom filter fever, we could use a bloom filter for this as well: filters based on multisets, (spectral bloom filters, count min sketches), allow storing approximate counts for values. The access counts for rows for the lifetime of a memtable could be stored in a filter attached to the memtable: at read time, the approximate access count of the row key could be checked against a threshold of accesses, and be stored in the memtable as superseding.
          Hide
          Ryan King added a comment -

          I think if we just start with tracking which sstables are accessed together, then prioritizing those for being compacted together we can improve our locality with minimal overhead.

          Fixed size (or variable within a range) SSTables could then also help. If we achieve the fixed size by splitting by token range when doing a compaction, we would have much more predictable read time for both skinny rows and wide rows. I'm not sure how well this would react to a rapidly changing workload, though.

          Show
          Ryan King added a comment - I think if we just start with tracking which sstables are accessed together, then prioritizing those for being compacted together we can improve our locality with minimal overhead. Fixed size (or variable within a range) SSTables could then also help. If we achieve the fixed size by splitting by token range when doing a compaction, we would have much more predictable read time for both skinny rows and wide rows. I'm not sure how well this would react to a rapidly changing workload, though.
          Hide
          Kelvin Kakugawa added a comment -

          An alternative strategy that dovetails w/ the above proposal (fixed-size SSTs and keeping track of SST co-access stats):
          Lazily coordinate timestamps across the cluster by maintaining increasing timestamps across SSTs on each replica.

          If we can enforce increasing timestamps across SSTs, then lookups will be cheaper. Instead of doing a slice (under the hood) for each lookup, we only need to read from the last SST where a key+column was written based on BFs (FPs not withstanding). Slices will not be improved by this scheme.

          A way to implement this would be to:
          1) on a CL.ONE write, always write to a lead replica, and
          2) on a CL.QUORUM write, a replica will reject a write w/ a timestamp less than the highest timestamp in that CF's last SST (may need to be MT); upon which, the client will need to re-submit a write w/ the appropriate timestamp offset.

          A consideration that needs to be taken into account is AES and other SST streaming operations across replicas. e.g. streamed SSTs will need to be lined up by min/max timestamp of the SSTs; if an SST overlaps, then we may need to, either:
          1) re-partition/compact the overlapping SSTs, or
          2) lazily compact the overlapping SSTs and absorb the cost to lookups, in the meantime.

          The benefits of loosely coordinated timestamps are:
          1) lookups will be measurably improved,
          2) dovetails nicely w/ fixed-size SSTs, and
          3) SST co-access stats can be more coarse-grained (based on SST), instead of fine-grained row-level stats.

          The above proposal is would align our data model to be closer to BigTable and HBase. i.e. lookups won't be penalized, anymore.

          Show
          Kelvin Kakugawa added a comment - An alternative strategy that dovetails w/ the above proposal (fixed-size SSTs and keeping track of SST co-access stats): Lazily coordinate timestamps across the cluster by maintaining increasing timestamps across SSTs on each replica. If we can enforce increasing timestamps across SSTs, then lookups will be cheaper. Instead of doing a slice (under the hood) for each lookup, we only need to read from the last SST where a key+column was written based on BFs (FPs not withstanding). Slices will not be improved by this scheme. A way to implement this would be to: 1) on a CL.ONE write, always write to a lead replica, and 2) on a CL.QUORUM write, a replica will reject a write w/ a timestamp less than the highest timestamp in that CF's last SST (may need to be MT); upon which, the client will need to re-submit a write w/ the appropriate timestamp offset. A consideration that needs to be taken into account is AES and other SST streaming operations across replicas. e.g. streamed SSTs will need to be lined up by min/max timestamp of the SSTs; if an SST overlaps, then we may need to, either: 1) re-partition/compact the overlapping SSTs, or 2) lazily compact the overlapping SSTs and absorb the cost to lookups, in the meantime. The benefits of loosely coordinated timestamps are: 1) lookups will be measurably improved, 2) dovetails nicely w/ fixed-size SSTs, and 3) SST co-access stats can be more coarse-grained (based on SST), instead of fine-grained row-level stats. The above proposal is would align our data model to be closer to BigTable and HBase. i.e. lookups won't be penalized, anymore.
          Hide
          Peter Schuller added a comment -

          With respect to tracking information, I meant tracking the information necessary to make the determination as to which rows to supersede or otherwise act upon, rather than keeping track of what has been superseded (which as you point out is dealt with by bloom filters).

          In the case of simple criteria (such as seeing a single read spread across more than N sstables) no such tracking is necessary. But I am concerned with strategies that imply having to keep significant amounts of data over time, such as anything based on row-level frequency/recency of access.

          In particular, if one hopes to supersede so effectively that hot rows (regardless of sstable spread) end up in separate sstables with high locality of access, will not this need such row-level information tracking?

          Show
          Peter Schuller added a comment - With respect to tracking information, I meant tracking the information necessary to make the determination as to which rows to supersede or otherwise act upon, rather than keeping track of what has been superseded (which as you point out is dealt with by bloom filters). In the case of simple criteria (such as seeing a single read spread across more than N sstables) no such tracking is necessary. But I am concerned with strategies that imply having to keep significant amounts of data over time, such as anything based on row-level frequency/recency of access. In particular, if one hopes to supersede so effectively that hot rows (regardless of sstable spread) end up in separate sstables with high locality of access, will not this need such row-level information tracking?
          Hide
          Stu Hood added a comment -

          > Regarding superseding rows in sstables; what would be the criteria for picking which rows to supersede for?
          A configurable threshold on access count is one metric: you could attach an access counter to rows in the memtable or rowcache. If a row reaches a threshold of accesses, then you supersede it by storing the data for the read in the memtable, and marking it superseded in any sstables.

          (This works for skinny rows, but just like with the row cache, we need a strategy for wider rows.)

          Another metric would be the minimum number of sstables to supersede: if a row is stored in 1 sstable, (obviously) don't supersede, but if it is stored in 4, supersede, etc.

          > would that require tracking some information over time at row level granularity? (I'm concerned about the overhead of such tracking.)
          I don't think so? If something was recently superseded, it will be stored in 1 sstable, and there is no need to supersede it reaches your threshold.

          > If so, an observation is that false positives are allowed for the stats
          I think this is what we get from the bloom filter edits: sometimes a supersede will fail for some sstables, but it's not really a big deal, since the whole operation probably still pushes you below your threshold for sstables.

          Show
          Stu Hood added a comment - > Regarding superseding rows in sstables; what would be the criteria for picking which rows to supersede for? A configurable threshold on access count is one metric: you could attach an access counter to rows in the memtable or rowcache. If a row reaches a threshold of accesses, then you supersede it by storing the data for the read in the memtable, and marking it superseded in any sstables. (This works for skinny rows, but just like with the row cache, we need a strategy for wider rows.) Another metric would be the minimum number of sstables to supersede: if a row is stored in 1 sstable, (obviously) don't supersede, but if it is stored in 4, supersede, etc. > would that require tracking some information over time at row level granularity? (I'm concerned about the overhead of such tracking.) I don't think so? If something was recently superseded, it will be stored in 1 sstable, and there is no need to supersede it reaches your threshold. > If so, an observation is that false positives are allowed for the stats I think this is what we get from the bloom filter edits: sometimes a supersede will fail for some sstables, but it's not really a big deal, since the whole operation probably still pushes you below your threshold for sstables.
          Hide
          Peter Schuller added a comment -

          Regarding superseding rows in sstables; what would be the criteria for picking which rows to supersede for? A simple threshold would be easy and certainly addresses extreme cases of rows being spread. But if one also expects to take into account of often the rows are read, that would imply recenticity or frequency tracking?

          A simple threshold on sstable count would certainly help avoiding the extreme cases of reads across many sstables.

          But in terms of trying to keep frequently accessed data together with high locality (as briefly alluded to in CASSANDRA-1625), would that require tracking some information over time at row level granularity? (I'm concerned about the overhead of such tracking.)

          If so, an observation is that false positives are allowed for the stats. I.e., recenticity/frequency could be associated with something like 64-bit-hash-of-key rather than keys, meaning that some optimizations become possible (e.g. smacking hash-of-key:counter pairs into large byte arrays).

          Show
          Peter Schuller added a comment - Regarding superseding rows in sstables; what would be the criteria for picking which rows to supersede for? A simple threshold would be easy and certainly addresses extreme cases of rows being spread. But if one also expects to take into account of often the rows are read, that would imply recenticity or frequency tracking? A simple threshold on sstable count would certainly help avoiding the extreme cases of reads across many sstables. But in terms of trying to keep frequently accessed data together with high locality (as briefly alluded to in CASSANDRA-1625 ), would that require tracking some information over time at row level granularity? (I'm concerned about the overhead of such tracking.) If so, an observation is that false positives are allowed for the stats. I.e., recenticity/frequency could be associated with something like 64-bit-hash-of-key rather than keys, meaning that some optimizations become possible (e.g. smacking hash-of-key:counter pairs into large byte arrays).
          Hide
          Stu Hood added a comment - - edited

          > I guess Cassandra would only need a fixed count of exactly 2, making it a non-issue.
          As you said: we would need a counting filter with 2 bits per bucket: if both bits are set, the bucket has collided.

          If any of the buckets for a key have collided, you can't perform the delete, but that isn't the end of the world here.

          EDIT: Actually, if you couldn't perform the delete, you might end up superseding a given sstable multiple times, so this is something we'd want to avoid.

          Show
          Stu Hood added a comment - - edited > I guess Cassandra would only need a fixed count of exactly 2, making it a non-issue. As you said: we would need a counting filter with 2 bits per bucket: if both bits are set, the bucket has collided. If any of the buckets for a key have collided, you can't perform the delete, but that isn't the end of the world here. EDIT: Actually, if you couldn't perform the delete, you might end up superseding a given sstable multiple times, so this is something we'd want to avoid.
          Hide
          Peter Schuller added a comment -

          Although now that I think about it some more, I guess Cassandra would only need a fixed count of exactly 2, making it a non-issue.

          Show
          Peter Schuller added a comment - Although now that I think about it some more, I guess Cassandra would only need a fixed count of exactly 2, making it a non-issue.
          Hide
          Peter Schuller added a comment -

          One thing to consider when deferring or omitting compaction of sstables is that it, similarly to a cache, will tend to optimize the performance in the common case but is susceptible to a sudden degredation of performance in response to a suddenly changing workload. That might be something to keep within reasonable limits, at least in out-of-the-box configurations.

          On Stu's (3); what bloom filter would you have in mind that supports deletions? Is there a good paper to read? Is there anything better than what is described at http://en.wikipedia.org/wiki/Bloom_filter#Counting_filters (indicating some algorithm that consumes roughly half as much as a counting bf; but still scaling linearly with max count)?

          Show
          Peter Schuller added a comment - One thing to consider when deferring or omitting compaction of sstables is that it, similarly to a cache, will tend to optimize the performance in the common case but is susceptible to a sudden degredation of performance in response to a suddenly changing workload. That might be something to keep within reasonable limits, at least in out-of-the-box configurations. On Stu's (3); what bloom filter would you have in mind that supports deletions? Is there a good paper to read? Is there anything better than what is described at http://en.wikipedia.org/wiki/Bloom_filter#Counting_filters (indicating some algorithm that consumes roughly half as much as a counting bf; but still scaling linearly with max count)?
          Hide
          Stu Hood added a comment -

          One way to provide locality of reference for sstables would be to persist summaries of individual rows which 'supersede' the content in sstables written before them. For example, if you have five sstables containing key 'A', you would create a new sstable #6 containing all content for 'A', and marked as superseding for 'A'. Then, since you have a full copy of data for 'A', you no longer need to read from the other sstables.

          But how would a reader know which sstables to check for a particular key? We have sstable generation numbers, but they are currently only used as unique ids. Three approaches:

          1. The 'superseding' mark for a particular key could indicate which generations it superseded, so we could prevent reads to superseded sstables
          2. If we refactored the system to read memtables/sstables in generation order, we could stop looking for content for A when we reached an sstable that superseded older sstables
          3. When superseding a value, we could delete it from the bloom filters for the superseded sstables. While hella cool, this solution would be a big change:
            • If you reached a threshold of emptiness in the bloom filter, you could perform single-sstable-compactions filtered by values that still match
            • Requires a bloom filter which supports deletes (ours don't yet)
          Show
          Stu Hood added a comment - One way to provide locality of reference for sstables would be to persist summaries of individual rows which 'supersede' the content in sstables written before them. For example, if you have five sstables containing key 'A', you would create a new sstable #6 containing all content for 'A', and marked as superseding for 'A'. Then, since you have a full copy of data for 'A', you no longer need to read from the other sstables. But how would a reader know which sstables to check for a particular key? We have sstable generation numbers, but they are currently only used as unique ids. Three approaches: The 'superseding' mark for a particular key could indicate which generations it superseded, so we could prevent reads to superseded sstables If we refactored the system to read memtables/sstables in generation order, we could stop looking for content for A when we reached an sstable that superseded older sstables When superseding a value, we could delete it from the bloom filters for the superseded sstables. While hella cool, this solution would be a big change: If you reached a threshold of emptiness in the bloom filter, you could perform single-sstable-compactions filtered by values that still match Requires a bloom filter which supports deletes (ours don't yet)

            People

            • Assignee:
              Benjamin Coverston
              Reporter:
              Chris Goffinet
              Reviewer:
              Jonathan Ellis
            • Votes:
              12 Vote for this issue
              Watchers:
              30 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development