Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Fix Version/s: 0.8 beta 1
    • Component/s: Core
    • Labels:

      Description

      This ticket overlaps with CASSANDRA-1876 to a degree, but the approaches and reasoning are different enough to open a separate issue.

      The problem with compactions currently is that they compact the set of sstables that existed the moment the compaction started. This means that for longer running compactions (even when running as fast as possible on the hardware), a very large number of new sstables might be created in the meantime. We have observed this proliferation of sstables killing performance during major/high-bucketed compactions.

      One approach would be to pause compactions in upper buckets (containing larger files) when compactions in lower buckets become possible. While this would likely solve the problem with read performance, it does not actually help us perform compaction any faster, which is a reasonable requirement for other situations.

      Instead, we need to be able to perform any compactions that are currently required in parallel, independent of what bucket they might be in.

        Issue Links

          Activity

          Hide
          Stu Hood added a comment -

          Thanks a lot for the great review!

          > We could make the flush stage multithreaded
          As a side note, flushing is already multithreaded: see the memtable_flush_writers setting.

          Show
          Stu Hood added a comment - Thanks a lot for the great review! > We could make the flush stage multithreaded As a side note, flushing is already multithreaded: see the memtable_flush_writers setting.
          Hide
          Sylvain Lebresne added a comment -

          Forgot the magical +1 (better late than never)

          Show
          Sylvain Lebresne added a comment - Forgot the magical +1 (better late than never)
          Hide
          Hudson added a comment -

          Integrated in Cassandra #848 (See https://hudson.apache.org/hudson/job/Cassandra/848/)
          Multithreaded compactions
          patch by stuhood; reviewed by slebresne for CASSANDRA-2191

          Show
          Hudson added a comment - Integrated in Cassandra #848 (See https://hudson.apache.org/hudson/job/Cassandra/848/ ) Multithreaded compactions patch by stuhood; reviewed by slebresne for CASSANDRA-2191
          Hide
          Sylvain Lebresne added a comment -

          For the record:

          Inlined stopTheWorld in 0005. Yes, I agree that the name sucked, but whether or not it is possible for a lock acquisition to fail on a server that is not already screwed, and whether an abstraction is in order here is still up for debate

          I do like the inlined version much more. I did not pretended that the previous version wasn't working. It was just hard to check that the umarking was happening correctly and even though I agree lock acquisition is unlikely to fail, it would have been easy for someone else to add lines inside stopTheWorld at the wrong place that could fail. And the name sucked

          Added an AtomicBoolean to AutoSavingCache in 0006. I reeeally think this should go to the flush stage, since the tasks have almost identical lifetimes, and we don't really need progress for either of them

          I just don't want for cache saving to block flush too long. So I'm not saying it should not go to flush stage ever, but I'm inconfortable putting it there without some proper testing of its impact. We could make the flush stage multithreaded (with throttling), then I would have no problem with moving cache saving there (but then we would still have to make sure only one saving happen at a time).

          Show
          Sylvain Lebresne added a comment - For the record: Inlined stopTheWorld in 0005. Yes, I agree that the name sucked, but whether or not it is possible for a lock acquisition to fail on a server that is not already screwed, and whether an abstraction is in order here is still up for debate I do like the inlined version much more. I did not pretended that the previous version wasn't working. It was just hard to check that the umarking was happening correctly and even though I agree lock acquisition is unlikely to fail, it would have been easy for someone else to add lines inside stopTheWorld at the wrong place that could fail. And the name sucked Added an AtomicBoolean to AutoSavingCache in 0006. I reeeally think this should go to the flush stage, since the tasks have almost identical lifetimes, and we don't really need progress for either of them I just don't want for cache saving to block flush too long. So I'm not saying it should not go to flush stage ever, but I'm inconfortable putting it there without some proper testing of its impact. We could make the flush stage multithreaded (with throttling), then I would have no problem with moving cache saving there (but then we would still have to make sure only one saving happen at a time).
          Hide
          Sylvain Lebresne added a comment -

          Committed as r1090978 (with 2 minor modifs: slightly updated message in cassandra.conf (the word 'imperative' sounded too much (not that I contest the utility of the patch or anything)) and avoid NPE in CompactionController.get

          {Keyspace|ColumnFamily}

          ).

          Great work, thanks (and thanks for putting up with my pickiness).

          Show
          Sylvain Lebresne added a comment - Committed as r1090978 (with 2 minor modifs: slightly updated message in cassandra.conf (the word 'imperative' sounded too much (not that I contest the utility of the patch or anything)) and avoid NPE in CompactionController.get {Keyspace|ColumnFamily} ). Great work, thanks (and thanks for putting up with my pickiness).
          Hide
          Stu Hood added a comment -
          • Inlined stopTheWorld in 0005. Yes, I agree that the name sucked, but whether or not it is possible for a lock acquisition to fail on a server that is not already screwed, and whether an abstraction is in order here is still up for debate
          • Removed the 'forceMajor' parameter: will open a ticket post-commit to allow for guaranteeing that a manually triggered compaction is major
          • Moved ksname/cfname into getters. I didn't do this initially because the CFS is sometimes null, but I guess you'd get the NPE in either case
          • Added an AtomicBoolean to AutoSavingCache in 0006. I reeeally think this should go to the flush stage, since the tasks have almost identical lifetimes, and we don't really need progress for either of them
          • Wrapped the IdentityHashMap into an IdentityHashSet
          • Returned printCompactionStats to its former glory
          • Removed OperationType from SSTableWriter.Builder's task type

          Thanks! CASSANDRA-2156 has been rebased as well.

          Show
          Stu Hood added a comment - Inlined stopTheWorld in 0005. Yes, I agree that the name sucked, but whether or not it is possible for a lock acquisition to fail on a server that is not already screwed, and whether an abstraction is in order here is still up for debate Removed the 'forceMajor' parameter: will open a ticket post-commit to allow for guaranteeing that a manually triggered compaction is major Moved ksname/cfname into getters. I didn't do this initially because the CFS is sometimes null, but I guess you'd get the NPE in either case Added an AtomicBoolean to AutoSavingCache in 0006. I reeeally think this should go to the flush stage, since the tasks have almost identical lifetimes, and we don't really need progress for either of them Wrapped the IdentityHashMap into an IdentityHashSet Returned printCompactionStats to its former glory Removed OperationType from SSTableWriter.Builder's task type Thanks! CASSANDRA-2156 has been rebased as well.
          Hide
          Sylvain Lebresne added a comment -

          Some comments on the new patches:

          • I (really) don't like the stopTheWorld function. I don't like the name and arguments names and I don't find clear the 'abstraction' it provides. But more importantly it makes the 'unmarking' of sstable very fragile, since the markCompacting function is not in the 'try .. final' block where the unmark function is. Let's just inline this stopTheWorld function instead (it won't really be more code).
          • The forcing of major compaction in doCompaction is buggy, because there is no guarantee we haven't skipped some sstable (because of the skip option) or that some sstable has been removed from the set by lack of disk space. Forcing a major compaction in those case is wrong. I know that when you submit a major compaction you could end up doing a minor one just because a sstable got flushed between the time you grabbed the set of sstables and the time you checked this set is still complete. But it is not a new problem (nor a big one, contrarily to wrongfully removing tombstones) so let's open a separate ticket if we want to fix that.
          • The ksname and cfname field in CompactionController should be replaced by getters (of cfs.metadata. {ksname|cfname}

            ), if only to make sure we don't make the renaming of ks/cf more problematic that it already is (it's probably no problem here but avoiding the duplication of fields would be cool anyway).

          • As far as I can tell we still have a problem with cache saving in that you can have two cache saving happening at the same time (which will be bad). To answer the proposition of a previous comment, I am not a fan of using the flush writer for this as it has more important things to do (and we would lose the reporting through JMX). A specific executor could solve the problem but it is a bit overkill. Maybe a static atomicBoolean that cache writing tasks would atomically compare to false and set to true at the beginning. If that succeed they would do the cache writing and set back to false the boolean, otherwise the task would just return directly. It means it would discard a cache saving if one is already running but that's probably fine (that's actually probably what you want).
          • I would still prefer a IdentityHashSet rather than a HashMap whose value is unused in CompactionExecutor for the sake of the clarity of the code.
          • In nodetool printCompactionStats, we should keep the println with plain English as it was before instead of using the much denser (and much less user friendly) CompactionInfo.toString().
          • In SSTableWriter.Builder, when creating the compactionInfo, putting the OperationType as it is done will look ugly and I think it will be more confusing than anything else to the user (the previous message was probably quite enough).

          FYI, my monday morning happens to be quite a few hours before the monday morning 'freeze'. So if there is fixes for the things above by my monday morning, I'll look at it in priority.

          Show
          Sylvain Lebresne added a comment - Some comments on the new patches: I (really) don't like the stopTheWorld function. I don't like the name and arguments names and I don't find clear the 'abstraction' it provides. But more importantly it makes the 'unmarking' of sstable very fragile, since the markCompacting function is not in the 'try .. final' block where the unmark function is. Let's just inline this stopTheWorld function instead (it won't really be more code). The forcing of major compaction in doCompaction is buggy, because there is no guarantee we haven't skipped some sstable (because of the skip option) or that some sstable has been removed from the set by lack of disk space. Forcing a major compaction in those case is wrong. I know that when you submit a major compaction you could end up doing a minor one just because a sstable got flushed between the time you grabbed the set of sstables and the time you checked this set is still complete. But it is not a new problem (nor a big one, contrarily to wrongfully removing tombstones) so let's open a separate ticket if we want to fix that. The ksname and cfname field in CompactionController should be replaced by getters (of cfs.metadata. {ksname|cfname} ), if only to make sure we don't make the renaming of ks/cf more problematic that it already is (it's probably no problem here but avoiding the duplication of fields would be cool anyway). As far as I can tell we still have a problem with cache saving in that you can have two cache saving happening at the same time (which will be bad). To answer the proposition of a previous comment, I am not a fan of using the flush writer for this as it has more important things to do (and we would lose the reporting through JMX). A specific executor could solve the problem but it is a bit overkill. Maybe a static atomicBoolean that cache writing tasks would atomically compare to false and set to true at the beginning. If that succeed they would do the cache writing and set back to false the boolean, otherwise the task would just return directly. It means it would discard a cache saving if one is already running but that's probably fine (that's actually probably what you want). I would still prefer a IdentityHashSet rather than a HashMap whose value is unused in CompactionExecutor for the sake of the clarity of the code. In nodetool printCompactionStats, we should keep the println with plain English as it was before instead of using the much denser (and much less user friendly) CompactionInfo.toString(). In SSTableWriter.Builder, when creating the compactionInfo, putting the OperationType as it is done will look ugly and I think it will be more confusing than anything else to the user (the previous message was probably quite enough). FYI, my monday morning happens to be quite a few hours before the monday morning 'freeze'. So if there is fixes for the things above by my monday morning, I'll look at it in priority.
          Hide
          Stu Hood added a comment - - edited

          Rebased.

          EDIT: FYI: the freeze is on Monday morning, so that might be too late.

          Show
          Stu Hood added a comment - - edited Rebased. EDIT: FYI: the freeze is on Monday morning, so that might be too late.
          Hide
          Sylvain Lebresne added a comment -

          I'm having a high number of flight hours this week-end. I'll try to use a time to look at this and review by monday morning worst case scenario.

          Show
          Sylvain Lebresne added a comment - I'm having a high number of flight hours this week-end. I'll try to use a time to look at this and review by monday morning worst case scenario.
          Hide
          Stu Hood added a comment -

          This is ready for review... we'd reeeally like to get it in before the freeze.

          Show
          Stu Hood added a comment - This is ready for review... we'd reeeally like to get it in before the freeze.
          Hide
          Stu Hood added a comment -
          • Implemented "acquire the write lock long enough to schedule" for major, cleanup and scrub: it's alone in patch 0005 for clarity
          • Removed the JMX methods I had deprecated before, and added a method that returns serialized objects for more programmatic access. The serialized object is concrete CompactionInfo, which replaces (I|A)CompactionInfo
          Show
          Stu Hood added a comment - Implemented "acquire the write lock long enough to schedule" for major, cleanup and scrub: it's alone in patch 0005 for clarity Removed the JMX methods I had deprecated before, and added a method that returns serialized objects for more programmatic access. The serialized object is concrete CompactionInfo, which replaces (I|A)CompactionInfo
          Hide
          Stu Hood added a comment -
          • Added a config option to disable multithreaded compaction
          • Fixed TODOs
          • Renamed clearUnsafe
          • Moved sstablescanner changes to CASSANDRA-2431
          • Removed dead code in CompactionManager

          > I think that the cache writing tasks were fed to the compaction executor
          Should these go to the flush writer stage instead?

          I'll look into the rest of the comments and post an updated patch tomorrow.

          Show
          Stu Hood added a comment - Added a config option to disable multithreaded compaction Fixed TODOs Renamed clearUnsafe Moved sstablescanner changes to CASSANDRA-2431 Removed dead code in CompactionManager > I think that the cache writing tasks were fed to the compaction executor Should these go to the flush writer stage instead? I'll look into the rest of the comments and post an updated patch tomorrow.
          Hide
          Sylvain Lebresne added a comment -

          Here's some number of remarks on this. I wrote those on a plane so I did not had access to what was previously said above, and I'm too lazy to edit those back (and anyway there little intersections and nothing is outdated I think).

          Comments:

          • An unfortunate consequence of this is that the submission of a compaction can now fail, in particular cleanup or major ones (more precisely, cleanup compactions will leave some sstable unclean and major ones won't be major and thus won't clean all tombstones). It will be a pain for users. Not only should they check the log once they summit one of those compactions to see what is left to do. It's also inefficient: for cleanups, you'll have to submit a new one which will redo a full cleanup compaction (and those are not as efficient as they could be). For major compaction, it's even worse. Anyway, grabbing the write lock instead of the read lock for those compaction should be good enough.
          • The different compactions weren't wrote to be run in parallel initially
            We need a way to deactivate this if something goes wrong. It could
            either be a flag that makes everyone acquire the write lock, or an
            option to make the compaction executor mono-threaded.
          • Related to the point above, I think that the cache writing tasks were fed to the compaction executor to make sure we never do 2 cache writing at the same time. We need to restore that property somehow.
          • There's a number of TODO in the code. I am not a fan of leaving TODOs unless there is a really good reason to, there is JIRA tickets for that. In particular, I do not agree with the fact that flusherLock and compactionLock should be replaced by a list of sstables (at least this is not as simple as this is put).
          • Let's not use clearUnsafe() to initialize DataTracker given its name (but I'm fine with renaming it say init() and CFStore.clearUnsafe() calls init()).
          • The patch about closing the sstableScanners should be another ticket for traceability sake.

          Some more minor remarks:

          • On the JMX reporting: I think it is ok to remove the methods instead of deprecating(it will be in a major release and there no indication that this is deprecated for the user anyway). Also, since looking at a list of strings in jconsole is a bit of a pain, it could be nice to at least expose the active count, and maybe a sum over all running compactions of bytesCompacted/toCompact (it would less ugly to expose a MBean for each CompactionInfo, but I'm not sure how easy/efficient that would be).
          • About ACompactionInfo, we use ICompactionInfo for interface, but AbstractCompationInfo for abstract classes. Let's keep it at that for coherence of the code base sake.
          • The compacting set field in CompactionManager is dead code. So is MIN_COMPACTION_THRESHOLD in CompactionsSet.
          • I don't find the docString for markCompacting very clear (it kinds of suggest markCompacting it be in a finally block, and it's unclear that this is this method that does the marking.
          • I would prefer adding forwarding methods in CFStore for mark/unmark since that's what we've done so far (instead of exporting DataTracker).
          • In validationCompaction, the try would ideally be the next thing after the markCompacting().
          • I'd prefer moving toString(CompactionInfo) in ICompactionInfo (or AbstractCompactionInfo). Also, since this is for JMX reporting, adding the object hashcode will confuse users more than anything else.
          • In compactionExecutor, I suppose the HashMap with Object value is just a way to deal with the absence of an IdentityHashSet. If so, for clarity I would prefer using Collections.newSetFromMap(new IdentityHashMap()). Or really just a set should do it as it would make no sense to redefine the CopmactionInfo equality to be anything else than reference equality.
          Show
          Sylvain Lebresne added a comment - Here's some number of remarks on this. I wrote those on a plane so I did not had access to what was previously said above, and I'm too lazy to edit those back (and anyway there little intersections and nothing is outdated I think). Comments: An unfortunate consequence of this is that the submission of a compaction can now fail, in particular cleanup or major ones (more precisely, cleanup compactions will leave some sstable unclean and major ones won't be major and thus won't clean all tombstones). It will be a pain for users. Not only should they check the log once they summit one of those compactions to see what is left to do. It's also inefficient: for cleanups, you'll have to submit a new one which will redo a full cleanup compaction (and those are not as efficient as they could be). For major compaction, it's even worse. Anyway, grabbing the write lock instead of the read lock for those compaction should be good enough. The different compactions weren't wrote to be run in parallel initially We need a way to deactivate this if something goes wrong. It could either be a flag that makes everyone acquire the write lock, or an option to make the compaction executor mono-threaded. Related to the point above, I think that the cache writing tasks were fed to the compaction executor to make sure we never do 2 cache writing at the same time. We need to restore that property somehow. There's a number of TODO in the code. I am not a fan of leaving TODOs unless there is a really good reason to, there is JIRA tickets for that. In particular, I do not agree with the fact that flusherLock and compactionLock should be replaced by a list of sstables (at least this is not as simple as this is put). Let's not use clearUnsafe() to initialize DataTracker given its name (but I'm fine with renaming it say init() and CFStore.clearUnsafe() calls init()). The patch about closing the sstableScanners should be another ticket for traceability sake. Some more minor remarks: On the JMX reporting: I think it is ok to remove the methods instead of deprecating(it will be in a major release and there no indication that this is deprecated for the user anyway). Also, since looking at a list of strings in jconsole is a bit of a pain, it could be nice to at least expose the active count, and maybe a sum over all running compactions of bytesCompacted/toCompact (it would less ugly to expose a MBean for each CompactionInfo, but I'm not sure how easy/efficient that would be). About ACompactionInfo, we use ICompactionInfo for interface, but AbstractCompationInfo for abstract classes. Let's keep it at that for coherence of the code base sake. The compacting set field in CompactionManager is dead code. So is MIN_COMPACTION_THRESHOLD in CompactionsSet. I don't find the docString for markCompacting very clear (it kinds of suggest markCompacting it be in a finally block, and it's unclear that this is this method that does the marking. I would prefer adding forwarding methods in CFStore for mark/unmark since that's what we've done so far (instead of exporting DataTracker). In validationCompaction, the try would ideally be the next thing after the markCompacting(). I'd prefer moving toString(CompactionInfo) in ICompactionInfo (or AbstractCompactionInfo). Also, since this is for JMX reporting, adding the object hashcode will confuse users more than anything else. In compactionExecutor, I suppose the HashMap with Object value is just a way to deal with the absence of an IdentityHashSet. If so, for clarity I would prefer using Collections.newSetFromMap(new IdentityHashMap()). Or really just a set should do it as it would make no sense to redefine the CopmactionInfo equality to be anything else than reference equality.
          Hide
          amorton added a comment -

          4) Sounds reasonable if throttling is on.

          6) I'm not familiar with the bloom filter optimization you mentioned. However it seems that more than anything else the major flag in doCompaction() indicates if the compaction is running on all sstables, regardless of how the process was triggered. i.e. the first ever minor compaction would also be marked as major by this logic. PrecompactedRow and LazilyCompactedRow will purge rows if the major flag is set or the key is only present in the sstables under compaction. I'm not sure why the extra check is there for minor compactions, but it looks like losing the fact the a major/manual compaction was started could change the purge behaviour.

          I'm also trying to understand if the isKeyInRemainingSSTables() in the AbstractCompactedRow sub classes could be affected by multithreading. e.g. CF with two buckets, high min compaction threshold so longer compaction, two concurrent minor compactions one in each bucket, row A in both buckets, if either thread processes row A before the other finishes it would stop that thread purging the row, is there a race condition that stops both threads purging the row?

          9) We do not use the value in the compactions map, could we set it to the current system time when beginCompaction() is called and use that to the sort the list ? was not a biggie

          Show
          amorton added a comment - 4) Sounds reasonable if throttling is on. 6) I'm not familiar with the bloom filter optimization you mentioned. However it seems that more than anything else the major flag in doCompaction() indicates if the compaction is running on all sstables, regardless of how the process was triggered. i.e. the first ever minor compaction would also be marked as major by this logic. PrecompactedRow and LazilyCompactedRow will purge rows if the major flag is set or the key is only present in the sstables under compaction. I'm not sure why the extra check is there for minor compactions, but it looks like losing the fact the a major/manual compaction was started could change the purge behaviour. I'm also trying to understand if the isKeyInRemainingSSTables() in the AbstractCompactedRow sub classes could be affected by multithreading. e.g. CF with two buckets, high min compaction threshold so longer compaction, two concurrent minor compactions one in each bucket, row A in both buckets, if either thread processes row A before the other finishes it would stop that thread purging the row, is there a race condition that stops both threads purging the row? 9) We do not use the value in the compactions map, could we set it to the current system time when beginCompaction() is called and use that to the sort the list ? was not a biggie
          Hide
          Stu Hood added a comment -

          Regarding 6.: there is an issue related to behaviour here: thank you for pointing it out. As implemented in this patch, triggering a major compaction will start it immediately with whatever sstables aren't already active in compaction: this means that triggering a major compaction while another compaction is running will not result in a major compaction. In the past, this wouldn't have been kosher, because major compactions were required to clean up tombstones.

          But now that we have the bloomfilter checking optimization for compaction, the new behaviour is probably sufficient: when there ''aren't'' any other compactions active, the "major" flag is an optimization that makes the bloomfilter checks unnecessary; when there ''are'' other compactions active, the bloomfilter check works as usual.

          Thoughts?

          Show
          Stu Hood added a comment - Regarding 6.: there is an issue related to behaviour here: thank you for pointing it out. As implemented in this patch, triggering a major compaction will start it immediately with whatever sstables aren't already active in compaction: this means that triggering a major compaction while another compaction is running will not result in a major compaction. In the past, this wouldn't have been kosher, because major compactions were required to clean up tombstones. But now that we have the bloomfilter checking optimization for compaction, the new behaviour is probably sufficient: when there ''aren't'' any other compactions active, the "major" flag is an optimization that makes the bloomfilter checks unnecessary; when there ''are'' other compactions active, the bloomfilter check works as usual. Thoughts?
          Hide
          Stu Hood added a comment - - edited

          1. Added if (max < min || max < 1) return null; check
          2. Switched to LinkedHashSet to preserve order
          3. Fixed, thanks
          4. The idea is that the throttling from CASSANDRA-2156 is a sufficient preventative measure for this, but it will be important to enable it by default, hopefully using the metrics from CASSANDRA-2171. I'd prefer not to have two tunables, but it's worth discussing.
          5. CompactionManagerMBean exposes getPendingTasks and getCompletedTasks with entirely different meanings, so I'm not sure we'd gain anything by this
          6. (see next comment)
          7. Yes, it probably would, but I think that is an issue for a separate ticket. Usually compacting the smallest bucket first (since the files are likely to be hot in cache) is the biggest win (which we do): it will be very rare for higher buckets to be more important
          8. This would probably be a good idea, for example, if you have more than max sstables in the minimum bucket and not enough active threads to parallelize the bucket. Opened CASSANDRA-2407
          9. I can't think of an easy way to do this, but if you can, I'm willing

          Show
          Stu Hood added a comment - - edited 1. Added if (max < min || max < 1) return null; check 2. Switched to LinkedHashSet to preserve order 3. Fixed, thanks 4. The idea is that the throttling from CASSANDRA-2156 is a sufficient preventative measure for this, but it will be important to enable it by default, hopefully using the metrics from CASSANDRA-2171 . I'd prefer not to have two tunables, but it's worth discussing. 5. CompactionManagerMBean exposes getPendingTasks and getCompletedTasks with entirely different meanings, so I'm not sure we'd gain anything by this 6. (see next comment) 7. Yes, it probably would, but I think that is an issue for a separate ticket. Usually compacting the smallest bucket first (since the files are likely to be hot in cache) is the biggest win (which we do): it will be very rare for higher buckets to be more important 8. This would probably be a good idea, for example, if you have more than max sstables in the minimum bucket and not enough active threads to parallelize the bucket. Opened CASSANDRA-2407 9. I can't think of an easy way to do this, but if you can, I'm willing
          Hide
          amorton added a comment -

          I have a bunch of questions mostly because I'm trying to understand the reasons for doing things.

          1. If max is 0 SSTableTracker.markCompacting() will return an empty list rather than null.
          2. CompactionManager.submitMinorIfNeeded() sorts the SSTables in the bucket to compact the older ones first. When the list is passed to SSTableTracker.markCompacting() the order is lost.
          3. In CompactionManager.submitIndexBuild() and submmitSSTableBuild() should the calls to executor be in an inner try block to ensure the lock is always released.
          4. If the size of the thread pool for CompactionManager.CompactionExecutor() is not configurable is there a risk of using too many threads and saturating the IO with compaction? Could some people want less than 1 thread per core?
          5. For my understanding: What about the CompactionExecutor using the JMXEnabledThreadPoolExecutor so it's stats come back in TP Stats ?
          6. There is a comment in CompactionManager.doCompaction() about relying on a single thread in compaction to when determining if it's a major compaction.
          7. The order in which the buckets are processed appears to be undefined. Would it make sense to order them by number of files or avg size so there is a more predictable outcome with multiple threads possibly working through a similar set of files?
          8. For my understanding: Have you considered adding a flag to so that a minor compaction will stop processing buckets if additional threads have started? I think this may make the compaction less aggressive as it would more quickly fall back to a single thread until more were needed again.
          9. The order of the list returned from CompactionExecutor.getCompactions() is undefined. Could they be returned in the order they were added to the executor to make to the data returned from CompactionExecutor.getColumnFamilyInProgress() more reliable?
          Show
          amorton added a comment - I have a bunch of questions mostly because I'm trying to understand the reasons for doing things. If max is 0 SSTableTracker.markCompacting() will return an empty list rather than null. CompactionManager.submitMinorIfNeeded() sorts the SSTables in the bucket to compact the older ones first. When the list is passed to SSTableTracker.markCompacting() the order is lost. In CompactionManager.submitIndexBuild() and submmitSSTableBuild() should the calls to executor be in an inner try block to ensure the lock is always released. If the size of the thread pool for CompactionManager.CompactionExecutor() is not configurable is there a risk of using too many threads and saturating the IO with compaction? Could some people want less than 1 thread per core? For my understanding: What about the CompactionExecutor using the JMXEnabledThreadPoolExecutor so it's stats come back in TP Stats ? There is a comment in CompactionManager.doCompaction() about relying on a single thread in compaction to when determining if it's a major compaction. The order in which the buckets are processed appears to be undefined. Would it make sense to order them by number of files or avg size so there is a more predictable outcome with multiple threads possibly working through a similar set of files? For my understanding: Have you considered adding a flag to so that a minor compaction will stop processing buckets if additional threads have started? I think this may make the compaction less aggressive as it would more quickly fall back to a single thread until more were needed again. The order of the list returned from CompactionExecutor.getCompactions() is undefined. Could they be returned in the order they were added to the executor to make to the data returned from CompactionExecutor.getColumnFamilyInProgress() more reliable?
          Hide
          Stu Hood added a comment -

          Rebased for trunk.

          Show
          Stu Hood added a comment - Rebased for trunk.
          Hide
          amorton added a comment -

          Stu, as discussed on IRC I tried to have a look at this but it failed to apply against the current trunk.

          aarons-MBP-2011:cassandra aaron$ git am patch/2191/0001-Add-a-compacting-set-to-sstabletracker.txt 
          Applying: Add a `compacting` set to sstabletracker
          error: patch failed: src/java/org/apache/cassandra/io/sstable/SSTableTracker.java:53
          error: src/java/org/apache/cassandra/io/sstable/SSTableTracker.java: patch does not apply
          Patch failed at 0001 Add a `compacting` set to sstabletracker
          
          Show
          amorton added a comment - Stu, as discussed on IRC I tried to have a look at this but it failed to apply against the current trunk. aarons-MBP-2011:cassandra aaron$ git am patch/2191/0001-Add-a-compacting-set-to-sstabletracker.txt Applying: Add a `compacting` set to sstabletracker error: patch failed: src/java/org/apache/cassandra/io/sstable/SSTableTracker.java:53 error: src/java/org/apache/cassandra/io/sstable/SSTableTracker.java: patch does not apply Patch failed at 0001 Add a `compacting` set to sstabletracker
          Hide
          Stu Hood added a comment -

          Adds a third patch to clean up JMX: deprecates the existing single compaction methods, and adds a list-of-strings output for running compactions.

          Ready for review.

          Show
          Stu Hood added a comment - Adds a third patch to clean up JMX: deprecates the existing single compaction methods, and adds a list-of-strings output for running compactions. Ready for review.
          Hide
          Jonathan Ellis added a comment -

          I would prefer simply trying to solve reporting for CM/CE; we only have information on those tasks because we have ICompactionInfo to tell us about it. I don't think we want to turn this into "add some kind of progress reporting for generic Runnables."

          Show
          Jonathan Ellis added a comment - I would prefer simply trying to solve reporting for CM/CE; we only have information on those tasks because we have ICompactionInfo to tell us about it. I don't think we want to turn this into "add some kind of progress reporting for generic Runnables."
          Hide
          Stu Hood added a comment -

          I just noticed this makes monitoring inaccurate: CompactionExecutor is going to need a bit of an overhaul to report multiple compactions. Alternatively, as stated in the FIXME in CompactionExecutor, it could become a real stage: perhaps we could find a way to generically report on running tasks in stages?

          Show
          Stu Hood added a comment - I just noticed this makes monitoring inaccurate: CompactionExecutor is going to need a bit of an overhaul to report multiple compactions. Alternatively, as stated in the FIXME in CompactionExecutor, it could become a real stage: perhaps we could find a way to generically report on running tasks in stages?
          Hide
          Stu Hood added a comment -

          Patch to add a "compacting" set to the SSTableTracker which is atomically modified to schedule compactions. SSTables are removed from the compacting set in a finally block.

          Also, converts the "compactionLock", which is only used by migrations (to completely stop compactions), to a read-write lock. Running compactions acquire as readers, migrations acquire as writer.

          Implications: up to #num-procs compactions will run at once, possibly within the same bucket, but likely in different buckets.

          This patch goes hand in hand with CASSANDRA-2156, which ensures that despite our multithreading, we don't trample other operations on the system.

          Show
          Stu Hood added a comment - Patch to add a "compacting" set to the SSTableTracker which is atomically modified to schedule compactions. SSTables are removed from the compacting set in a finally block. Also, converts the "compactionLock", which is only used by migrations (to completely stop compactions), to a read-write lock. Running compactions acquire as readers, migrations acquire as writer. Implications: up to #num-procs compactions will run at once, possibly within the same bucket, but likely in different buckets. This patch goes hand in hand with CASSANDRA-2156 , which ensures that despite our multithreading, we don't trample other operations on the system.

            People

            • Assignee:
              Stu Hood
              Reporter:
              Stu Hood
              Reviewer:
              Sylvain Lebresne
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development