Cassandra
  1. Cassandra
  2. CASSANDRA-2521

Move away from Phantom References for Compaction/Memtable

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 1.0.0
    • Component/s: None
    • Labels:
      None

      Description

      http://wiki.apache.org/cassandra/MemtableSSTable

      Let's move to using reference counting instead of relying on GC to be called in StorageService.

      1. 0001-Use-reference-counting-to-decide-when-a-sstable-can-.patch
        47 kB
        Sylvain Lebresne
      2. 0001-Use-reference-counting-to-decide-when-a-sstable-can-v2.patch
        46 kB
        Sylvain Lebresne
      3. 0002-Force-unmapping-files-before-deletion-v2.patch
        7 kB
        Sylvain Lebresne
      4. 2521-v3.txt
        50 kB
        Jonathan Ellis
      5. 2521-v4.txt
        56 kB
        Sylvain Lebresne
      6. 2521-v5.patch
        60 kB
        Sylvain Lebresne

        Issue Links

          Activity

          Hide
          Jonathan Ellis added a comment -

          I imagine we could have a read-in-progress AtomicInt on the SSTR objects like we tried for Memtables w/ CASSANDRA-1954.

          Show
          Jonathan Ellis added a comment - I imagine we could have a read-in-progress AtomicInt on the SSTR objects like we tried for Memtables w/ CASSANDRA-1954 .
          Hide
          Jeff Kesselman added a comment -

          Relying on either finalizers or phantom references for your own clean up is not correct. These mechanisms are not triggered when an object becomes a candidate for collection, but only when the space is actually reclaimed.

          The time between these two states is up to the gc implementation and can be infinite in that the gc is never required to collect any given object unless failure to collect that object would result in an Out Of Memory condition. GCs often only partially collect, choosing the easiest objects to clean up first and going no further if no more space is required. A null gc that never collects space is actually perfectly legal under the java VM specification and has actually been used by IBM in the past for certain operational environments.

          Show
          Jeff Kesselman added a comment - Relying on either finalizers or phantom references for your own clean up is not correct. These mechanisms are not triggered when an object becomes a candidate for collection, but only when the space is actually reclaimed. The time between these two states is up to the gc implementation and can be infinite in that the gc is never required to collect any given object unless failure to collect that object would result in an Out Of Memory condition. GCs often only partially collect, choosing the easiest objects to clean up first and going no further if no more space is required. A null gc that never collects space is actually perfectly legal under the java VM specification and has actually been used by IBM in the past for certain operational environments.
          Hide
          Jeff Kesselman added a comment -

          It should also be noted that any object with either a finalizer or a phantom reference to it cannot be cleaned up in the eden by the hotspot generational collector and is forced to proceed to the next generation where gc is significantly more expensive. Whether this has any real effect on your gc times depends on the lifetime of the object in question. Creating a lot of short lived objects with finalizers or phantom references is bad. Creating middle to long lived objects with finalizers or phantom references is less of a problem.

          Show
          Jeff Kesselman added a comment - It should also be noted that any object with either a finalizer or a phantom reference to it cannot be cleaned up in the eden by the hotspot generational collector and is forced to proceed to the next generation where gc is significantly more expensive. Whether this has any real effect on your gc times depends on the lifetime of the object in question. Creating a lot of short lived objects with finalizers or phantom references is bad. Creating middle to long lived objects with finalizers or phantom references is less of a problem.
          Hide
          Jonathan Ellis added a comment -

          Sylvain has the largest backlog of any contributor, so I'm going to take this off his plate so someone with more time (Jeffrey?) can take a stab at it.

          Show
          Jonathan Ellis added a comment - Sylvain has the largest backlog of any contributor, so I'm going to take this off his plate so someone with more time (Jeffrey?) can take a stab at it.
          Hide
          Sylvain Lebresne added a comment -

          Actually I started working on this yesterday evening and I think I'm almost done. So re-assigning to myself for now

          Show
          Sylvain Lebresne added a comment - Actually I started working on this yesterday evening and I think I'm almost done. So re-assigning to myself for now
          Hide
          Terje Marthinussen added a comment -

          Fabulous!

          If you backport that to 0.8 (or if it is back-portable), I will volunteer to give it a decent doze of real life testing asap.

          Show
          Terje Marthinussen added a comment - Fabulous! If you backport that to 0.8 (or if it is back-portable), I will volunteer to give it a decent doze of real life testing asap.
          Hide
          Sylvain Lebresne added a comment -

          Attaching patch against trunk. Tests are passing and it seems to work, at least with small tests. I started a stress on a 3 node cluster with a repair and a major compaction started towards the end and compacted files did wait to be fully streamed to be removed and I didn't hit any bump (I did hit CASSANDRA-2769 a bunch of time but that's another story).

          Still, this is a fairly tricky problem so it could use other eyes. The basics are fairly simple though: each time a thread want to do something with a SSTableReader, it "acquires" a reference to that sstable and releases it when done. SSTableReader just keep a counter of acquired references. When the sstable has been marked compacted, we start looking until all acquired reference has been released. When that's the case, the file can be removed.

          Obviously the main drawback of this approach (compared to the phantomReference one) is that there is room for error. If a consumer forgot to acquire a reference (or do it in a non-thread-safe manner), the sstable can be removed. Thankfully there is not so many place in the code that needs to do this so hopefully I haven't missed any place.

          The other thing is that if a reference on a sstable is acquired, it should be released (otherwise the sstable will not be removed until next restart). I've try to ensure this using try-catch block, but it's not really possible with the way streaming works. However, if streaming fails, it's not really worst than before since the files where not cleaned due to the (failed) session staying in the global map of streaming sessions. CASSANDRA-2433 should fix that in most cases anyway.

          Last thing is http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4715154. In other words, the deletion of a file won't work until the mmapping is finalized (aka, GC, Where art thou), at least on windows. For that reason, when the deletion of file fails (after the usual number of retries, which btw may make less sense now), the deletion task is saved in a global list. If Cassandra is low on disk, it will still trigger a GC, after which it will reschedule all failed files in the hope they can now be deleted. There is also a JMX call to retry this rescheduling.

          Show
          Sylvain Lebresne added a comment - Attaching patch against trunk. Tests are passing and it seems to work, at least with small tests. I started a stress on a 3 node cluster with a repair and a major compaction started towards the end and compacted files did wait to be fully streamed to be removed and I didn't hit any bump (I did hit CASSANDRA-2769 a bunch of time but that's another story). Still, this is a fairly tricky problem so it could use other eyes. The basics are fairly simple though: each time a thread want to do something with a SSTableReader, it "acquires" a reference to that sstable and releases it when done. SSTableReader just keep a counter of acquired references. When the sstable has been marked compacted, we start looking until all acquired reference has been released. When that's the case, the file can be removed. Obviously the main drawback of this approach (compared to the phantomReference one) is that there is room for error. If a consumer forgot to acquire a reference (or do it in a non-thread-safe manner), the sstable can be removed. Thankfully there is not so many place in the code that needs to do this so hopefully I haven't missed any place. The other thing is that if a reference on a sstable is acquired, it should be released (otherwise the sstable will not be removed until next restart). I've try to ensure this using try-catch block, but it's not really possible with the way streaming works. However, if streaming fails, it's not really worst than before since the files where not cleaned due to the (failed) session staying in the global map of streaming sessions. CASSANDRA-2433 should fix that in most cases anyway. Last thing is http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4715154 . In other words, the deletion of a file won't work until the mmapping is finalized (aka, GC, Where art thou), at least on windows. For that reason, when the deletion of file fails (after the usual number of retries, which btw may make less sense now), the deletion task is saved in a global list. If Cassandra is low on disk, it will still trigger a GC, after which it will reschedule all failed files in the hope they can now be deleted. There is also a JMX call to retry this rescheduling.
          Hide
          Jonathan Ellis added a comment -

          If you backport that to 0.8 (or if it is back-portable), I will volunteer to give it a decent doze of real life testing asap.

          I am not comfortable changing this in a stable release series, but trunk is a very small collection of changes on top of 0.8 right now (CASSANDRA-2062 is the biggest). You'd be almost as stable using this patch in trunk, as you would be backporting it.

          Show
          Jonathan Ellis added a comment - If you backport that to 0.8 (or if it is back-portable), I will volunteer to give it a decent doze of real life testing asap. I am not comfortable changing this in a stable release series, but trunk is a very small collection of changes on top of 0.8 right now ( CASSANDRA-2062 is the biggest). You'd be almost as stable using this patch in trunk, as you would be backporting it.
          Hide
          Jonathan Ellis added a comment -

          the deletion of a file won't work until the mmapping is finalized (aka, GC, Where art thou), at least on windows

          Ugh, totally forgot about that. Probably one of the reasons we went with "might as well use a phantom reference" in the first place.

          Linux does allow the delete but I suspect the space doesn't actually get freed up until the munmap still. So not really an improvement there either.

          Is it possible to do mmap/munmap via JNA for libc-based systems? I think I'd rather fall back to non-mmap'd i/o on windows, than continue to hack around the GC like this.

          Show
          Jonathan Ellis added a comment - the deletion of a file won't work until the mmapping is finalized (aka, GC, Where art thou), at least on windows Ugh, totally forgot about that. Probably one of the reasons we went with "might as well use a phantom reference" in the first place. Linux does allow the delete but I suspect the space doesn't actually get freed up until the munmap still. So not really an improvement there either. Is it possible to do mmap/munmap via JNA for libc-based systems? I think I'd rather fall back to non-mmap'd i/o on windows, than continue to hack around the GC like this.
          Hide
          T Jake Luciani added a comment -

          We could use this trick:

            try {
                  // Manually free the old buffer using undocumented unsafe APIs.
                  // If this fails, we'll drop the reference and hope GC finds it
                  // eventually.
                  Object cleaner = buf.getClass().getMethod("cleaner").invoke(buf);
                  cleaner.getClass().getMethod("clean").invoke(cleaner);
                } catch (Exception e) {
                  // Perhaps a non-sun-derived JVM - contributions welcome
                  LOG.warn("Couldn't realloc bytebuffer", e);
                }
          
          
          Show
          T Jake Luciani added a comment - We could use this trick: try { // Manually free the old buffer using undocumented unsafe APIs. // If this fails, we'll drop the reference and hope GC finds it // eventually. Object cleaner = buf.getClass().getMethod("cleaner").invoke(buf); cleaner.getClass().getMethod("clean").invoke(cleaner); } catch (Exception e) { // Perhaps a non-sun-derived JVM - contributions welcome LOG.warn("Couldn't realloc bytebuffer", e); }
          Hide
          Ryan King added a comment -

          +1 for less "hacking around the GC"

          Show
          Ryan King added a comment - +1 for less "hacking around the GC"
          Hide
          Peter Schuller added a comment -

          @jbellis The unlink is fine (just like with files that are opened w/o mmap), but the actual deletion has to be postponed for fundamental reasons: How would the userland application be informed of an attempted access? You could segfault of course, or similar delivery of an asynchronous event, but that's not very useful to most applications.

          This closely relates to why Java doesn't allow forced munmap to begin with; having to do some kind of synchronization to allow safe access to munmapped memory in a way that doesn't violate the Java sandbox, would be very costly given that a large point of mmap is to utilize the CPU's MMU in the non-faulting case for a minimum of overhead.

          (I'm very much +1 on not using the GC for external resources like sstables (for trunk).)

          @jbellis And yes we can do our "own" mmap:ing with JNA as an alternative, but given that tjake's solution should work for Sun JVM:s even without JNA that seems preferable to me. Longer term support for non-hotspot JVM:s seems like a lesser concern. Cassandra isn't really something you want to run on an "arbitrary" JVM anyway.

          Show
          Peter Schuller added a comment - @jbellis The unlink is fine (just like with files that are opened w/o mmap), but the actual deletion has to be postponed for fundamental reasons: How would the userland application be informed of an attempted access? You could segfault of course, or similar delivery of an asynchronous event, but that's not very useful to most applications. This closely relates to why Java doesn't allow forced munmap to begin with; having to do some kind of synchronization to allow safe access to munmapped memory in a way that doesn't violate the Java sandbox, would be very costly given that a large point of mmap is to utilize the CPU's MMU in the non-faulting case for a minimum of overhead. (I'm very much +1 on not using the GC for external resources like sstables (for trunk).) @jbellis And yes we can do our "own" mmap:ing with JNA as an alternative, but given that tjake's solution should work for Sun JVM:s even without JNA that seems preferable to me. Longer term support for non-hotspot JVM:s seems like a lesser concern. Cassandra isn't really something you want to run on an "arbitrary" JVM anyway.
          Hide
          Jonathan Ellis added a comment -

          We could use this trick

          It's not obvious that this works with mapped buffers (as opposed to "normal" malloc'd direct buffers) but it looks like it does:

          DirectByteBuffer constructor, used via reflection to create MappedByteBuffer objects

              // For memory-mapped buffers -- invoked by FileChannelImpl via reflection
              protected DirectByteBuffer(int cap, long addr, Runnable unmapper) {
                  super(-1, 0, cap, cap, true);
          	address = addr;
                  viewedBuffer = null;
          	cleaner = Cleaner.create(this, unmapper);
              }
          

          Here's the unmapper that gets passed, from FCI:

              private static class Unmapper
                  implements Runnable
              {
          
                  private long address;
                  private long size;
          
                  private Unmapper(long address, long size) {
                      assert (address != 0);
                      this.address = address;
                      this.size = size;
                  }
          
                  public void run() {
                      if (address == 0)
                          return;
                      unmap0(address, size);
                      address = 0;
                  }
          
              }
          

          (And of course Cleaner.clean just calls run on its Runnable, with a second layer of only-run-this-once wrapping.)

          Show
          Jonathan Ellis added a comment - We could use this trick It's not obvious that this works with mapped buffers (as opposed to "normal" malloc'd direct buffers) but it looks like it does: DirectByteBuffer constructor, used via reflection to create MappedByteBuffer objects // For memory-mapped buffers -- invoked by FileChannelImpl via reflection protected DirectByteBuffer( int cap, long addr, Runnable unmapper) { super (-1, 0, cap, cap, true ); address = addr; viewedBuffer = null ; cleaner = Cleaner.create( this , unmapper); } Here's the unmapper that gets passed, from FCI: private static class Unmapper implements Runnable { private long address; private long size; private Unmapper( long address, long size) { assert (address != 0); this .address = address; this .size = size; } public void run() { if (address == 0) return ; unmap0(address, size); address = 0; } } (And of course Cleaner.clean just calls run on its Runnable, with a second layer of only-run-this-once wrapping.)
          Hide
          Terje Marthinussen added a comment -

          We stopped using mmap a while back. Benchmarks was highly inconclusive about any performance benefits and the operational benefits of seeing how much memory the darned thing actually uses and general improved stabilty has so far easily outweighed any teorethical lower singel digit performance benefit we could find in static benchmarks and on more realistic use we so far don't really see any performance issues.

          Personally I have no problems with a patch with that limitation.

          Show
          Terje Marthinussen added a comment - We stopped using mmap a while back. Benchmarks was highly inconclusive about any performance benefits and the operational benefits of seeing how much memory the darned thing actually uses and general improved stabilty has so far easily outweighed any teorethical lower singel digit performance benefit we could find in static benchmarks and on more realistic use we so far don't really see any performance issues. Personally I have no problems with a patch with that limitation.
          Hide
          Sylvain Lebresne added a comment -

          Attaching rebased first patch and a second patch to implement the "Cleaner" trick.

          I have confirmed on an example that, at least on linux, it does force the unmapping: the jvm crashes if you try to access the buffer after the unmapping.

          This is the biggest drawback of this approach imho. If we screw up with the reference counting and some thread does access the mapping, we won't get a nice exception, the JVM will simply crash (with the headache of having to find if it does is a bug on our side or a JVM bug). But for the quick testing I've done, it seems to work correctly.

          Show
          Sylvain Lebresne added a comment - Attaching rebased first patch and a second patch to implement the "Cleaner" trick. I have confirmed on an example that, at least on linux, it does force the unmapping: the jvm crashes if you try to access the buffer after the unmapping. This is the biggest drawback of this approach imho. If we screw up with the reference counting and some thread does access the mapping, we won't get a nice exception, the JVM will simply crash (with the headache of having to find if it does is a bug on our side or a JVM bug). But for the quick testing I've done, it seems to work correctly.
          Hide
          Jonathan Ellis added a comment -

          rebased, merged patches. updated the deletion task to not bother retrying until rescheduleFailedTasks is called – there's negligible chance that GC will happen to force an unmap otherwise. Also, added rescheduleFailedTasks to GCInspector when it is notified of a CMS.

          why does DT.removeOldSSTableSize acquire/release around markCompacted?

          Show
          Jonathan Ellis added a comment - rebased, merged patches. updated the deletion task to not bother retrying until rescheduleFailedTasks is called – there's negligible chance that GC will happen to force an unmap otherwise. Also, added rescheduleFailedTasks to GCInspector when it is notified of a CMS. why does DT.removeOldSSTableSize acquire/release around markCompacted?
          Hide
          Peter Schuller added a comment -

          For the record, if a subsequent mapping overlaps with the one that got unmapped, it's possible reads give randomly selected data from some other file, or writes corrupting some other file. So failure cases are not limited to crashes (which would be preferable in this case).

          Show
          Peter Schuller added a comment - For the record, if a subsequent mapping overlaps with the one that got unmapped, it's possible reads give randomly selected data from some other file, or writes corrupting some other file. So failure cases are not limited to crashes (which would be preferable in this case).
          Hide
          Terje Marthinussen added a comment -

          I have not tested Jonathan's updated version, but the version before seems to work good so far.

          Repair still keep data around while running, but that may be solved with CASSANDRA-2816.

          Other than that, sstables in general seems to disappear within a minute of being freed from compaction.

          Except for when running repair, I have to actively "hunt" for -Compacted files to find them now.

          Show
          Terje Marthinussen added a comment - I have not tested Jonathan's updated version, but the version before seems to work good so far. Repair still keep data around while running, but that may be solved with CASSANDRA-2816 . Other than that, sstables in general seems to disappear within a minute of being freed from compaction. Except for when running repair, I have to actively "hunt" for -Compacted files to find them now.
          Hide
          Sylvain Lebresne added a comment -

          Repair still keep data around while running, but that may be solved with CASSANDRA-2816.

          This is the expected behavior. Or more precisely, it all depends on what we're talking about. When repair is working with a sstable, we cannot and we should not delete it. Even if they have been compacted, if we're in the middle of streaming them, we should not delete it, there is nothing we can do about it. Now, I've tried to make it so that as soon as a file is not needed by repair anymore, it is deleted and thus if what you're talking about is sstable that get kept around forever, then it is a bug in the patch. To have a -Compacted file around don't mean the sstable can be deleted and it will never be.

          Show
          Sylvain Lebresne added a comment - Repair still keep data around while running, but that may be solved with CASSANDRA-2816 . This is the expected behavior. Or more precisely, it all depends on what we're talking about. When repair is working with a sstable, we cannot and we should not delete it. Even if they have been compacted, if we're in the middle of streaming them, we should not delete it, there is nothing we can do about it. Now, I've tried to make it so that as soon as a file is not needed by repair anymore, it is deleted and thus if what you're talking about is sstable that get kept around forever, then it is a bug in the patch. To have a -Compacted file around don't mean the sstable can be deleted and it will never be.
          Hide
          Terje Marthinussen added a comment -

          Yes, no problem that files stay for a while during repair, but I believe currently, files are not deleted until the entire repair is finished even if they have completed streaming, and I suppose that is not needed?

          Regardless, I believe CASSANDRA-2816 may fix this?

          Show
          Terje Marthinussen added a comment - Yes, no problem that files stay for a while during repair, but I believe currently, files are not deleted until the entire repair is finished even if they have completed streaming, and I suppose that is not needed? Regardless, I believe CASSANDRA-2816 may fix this?
          Hide
          Terje Marthinussen added a comment -

          When that is said, I have to actually verify that nothing is deleted until the repair is completed. I am pretty sure this happened with GC doing the cleanup, but I have actually not verified that with this patch.

          I just observe that I do not see anything get deleted until it finishes, but just because I don't see that with some occasional ls's does not mean it does not happen.

          Show
          Terje Marthinussen added a comment - When that is said, I have to actually verify that nothing is deleted until the repair is completed. I am pretty sure this happened with GC doing the cleanup, but I have actually not verified that with this patch. I just observe that I do not see anything get deleted until it finishes, but just because I don't see that with some occasional ls's does not mean it does not happen.
          Hide
          Terje Marthinussen added a comment -

          Hm... there may actually be something here.
          I am getting some *Compacted which does not seem to go away, not even after 7 hours after finished.

          3 Compactions 2 of them have left 1 sstable, one has left 2.

          I also happened to run 3 repairs, but one compaction starts just after one of those repairs finished (the other 2 starts and finished during the repair.

          This is running 2 as minimum compaction threshold.
          The compaction which left 2 files, left them after a compaction of 3 files.
          The 2 other *Compacted was made after compactions of 2 files

          The system is under reasonably heavy input of data and sstables are added, compacted and removed all the time. Just these which do not get removed and the oldest is 8 hours now.

          Note, I do not use mmap.

          Show
          Terje Marthinussen added a comment - Hm... there may actually be something here. I am getting some *Compacted which does not seem to go away, not even after 7 hours after finished. 3 Compactions 2 of them have left 1 sstable, one has left 2. I also happened to run 3 repairs, but one compaction starts just after one of those repairs finished (the other 2 starts and finished during the repair. This is running 2 as minimum compaction threshold. The compaction which left 2 files, left them after a compaction of 3 files. The 2 other *Compacted was made after compactions of 2 files The system is under reasonably heavy input of data and sstables are added, compacted and removed all the time. Just these which do not get removed and the oldest is 8 hours now. Note, I do not use mmap.
          Hide
          Terje Marthinussen added a comment -

          Hm, forget that. All 3 compactions happens during the repairs.

          This test is also using CASSANDRA-2433 and there are no errors reported during the repairs.

          Only one of the leftover sstables is listed in the log entries for stream context metadata

          Show
          Terje Marthinussen added a comment - Hm, forget that. All 3 compactions happens during the repairs. This test is also using CASSANDRA-2433 and there are no errors reported during the repairs. Only one of the leftover sstables is listed in the log entries for stream context metadata
          Hide
          Terje Marthinussen added a comment -

          Did another repair overnight, one minor compaction included some 20 small sstables, all of them remains as well as a few from other compactions and the files from the repairs described before.

          Definitely something fishy here.

          Show
          Terje Marthinussen added a comment - Did another repair overnight, one minor compaction included some 20 small sstables, all of them remains as well as a few from other compactions and the files from the repairs described before. Definitely something fishy here.
          Hide
          Terje Marthinussen added a comment -

          As for the last version of this patch, a quick look tonight shows access problems with markCurrentViewReferenced() which is called from outside cfs.

          Show
          Terje Marthinussen added a comment - As for the last version of this patch, a quick look tonight shows access problems with markCurrentViewReferenced() which is called from outside cfs.
          Hide
          Sylvain Lebresne added a comment -

          why does DT.removeOldSSTableSize acquire/release around markCompacted?

          For the record, this is because this make the logic in SSTR.markCompacted and SSTR.releaseReference easier. If caller of markCompacted don't acquire a reference, markCompacted will have to deal with two cases: either no thread have a reference acquired, in which case the current thread should schedule the deletion, or other thread have a reference in which case it should let them the task of scheduling the deletion where they are done. But making this thread safe (so that we don't schedule twice or forget to schedule the deletion if the last thread holding a reference release it at the same time as the markCompacted is called) is a bit of annoying. Acquire a reference when markCompacted is called make this easier and move all the logic in releaseReference.

          I believe currently, files are not deleted until the entire repair is finished

          The file should get deleted as soon as they are not useful anymore, that is as soon as they have been streamed. That being said, there was a bug, see below.

          Did another repair overnight, one minor compaction included some 20 small sstables, all of them remains as well as a few from other compactions and the files from the repairs described before

          Yes, I did find a place where we were not correctly decrementing the reference count for streaming (repair was not unmarking sstable that were not streamed because they had nothing for the range to transfer). Attached v4 patch should fix that.

          As for the last version of this patch, a quick look tonight shows access problems with markCurrentViewReferenced()

          v4 is based on v3 and fixes this (it reintroduce a specific method instead of making View public because I'm not too keen on doing that, but that can change if someone feels strongly about that).

          v4 also fix a bug in StreamingTransferTest and another one related to null segment in the unmmapping cleanup code.

          Show
          Sylvain Lebresne added a comment - why does DT.removeOldSSTableSize acquire/release around markCompacted? For the record, this is because this make the logic in SSTR.markCompacted and SSTR.releaseReference easier. If caller of markCompacted don't acquire a reference, markCompacted will have to deal with two cases: either no thread have a reference acquired, in which case the current thread should schedule the deletion, or other thread have a reference in which case it should let them the task of scheduling the deletion where they are done. But making this thread safe (so that we don't schedule twice or forget to schedule the deletion if the last thread holding a reference release it at the same time as the markCompacted is called) is a bit of annoying. Acquire a reference when markCompacted is called make this easier and move all the logic in releaseReference. I believe currently, files are not deleted until the entire repair is finished The file should get deleted as soon as they are not useful anymore, that is as soon as they have been streamed. That being said, there was a bug, see below. Did another repair overnight, one minor compaction included some 20 small sstables, all of them remains as well as a few from other compactions and the files from the repairs described before Yes, I did find a place where we were not correctly decrementing the reference count for streaming (repair was not unmarking sstable that were not streamed because they had nothing for the range to transfer). Attached v4 patch should fix that. As for the last version of this patch, a quick look tonight shows access problems with markCurrentViewReferenced() v4 is based on v3 and fixes this (it reintroduce a specific method instead of making View public because I'm not too keen on doing that, but that can change if someone feels strongly about that). v4 also fix a bug in StreamingTransferTest and another one related to null segment in the unmmapping cleanup code.
          Hide
          Jonathan Ellis added a comment -

          +1 from me in principle, suggest waiting for Terje's test before committing tho

          Show
          Jonathan Ellis added a comment - +1 from me in principle, suggest waiting for Terje's test before committing tho
          Hide
          Jonathan Ellis added a comment -

          (suggest adding a comment for the markCompacted acquire/release code on commit.)

          Show
          Jonathan Ellis added a comment - (suggest adding a comment for the markCompacted acquire/release code on commit.)
          Hide
          Terje Marthinussen added a comment -

          First impression is good.

          I merged this with 2816 last night and tested overnight with a dataset that normally uses 32-40% disk space when compacted.

          Calling repair on a handful of nodes, I would normally have a good chance of seeing a node or two with 90%+ full disk and regular full GC's.

          With these patches, disk use seem to peak in the 55-65% disk area which is a very significant improvement.

          However, unfortunately, I started repairs on 5 of 12 nodes and one of them has gone crazy however and filled the disk 100%.

          I did happen to start repair twice on this node by accident.
          I somewhat do not think that is the real problem however, but I am not sure.
          Will make a ticket anyway on adding functionality to prevent repairs from getting started twice on the same CF and I will test more to see if this happens again.

          However I noticed this in the log:
          INFO [Thread-185] 2011-06-28 05:01:15,390 StorageService.java (line 2083) requesting GC to free disk space
          I guess we can get rid of that?
          Should be nothing good from calling full GC there anymore?

          If anything, maybe we should consider a variation here where cassandra instead aborts all repairs and compactions and clean up everything before trying again with smaller amounts of data in each compaction/repair attempt?

          Another ticket on that maybe?

          Show
          Terje Marthinussen added a comment - First impression is good. I merged this with 2816 last night and tested overnight with a dataset that normally uses 32-40% disk space when compacted. Calling repair on a handful of nodes, I would normally have a good chance of seeing a node or two with 90%+ full disk and regular full GC's. With these patches, disk use seem to peak in the 55-65% disk area which is a very significant improvement. However, unfortunately, I started repairs on 5 of 12 nodes and one of them has gone crazy however and filled the disk 100%. I did happen to start repair twice on this node by accident. I somewhat do not think that is the real problem however, but I am not sure. Will make a ticket anyway on adding functionality to prevent repairs from getting started twice on the same CF and I will test more to see if this happens again. However I noticed this in the log: INFO [Thread-185] 2011-06-28 05:01:15,390 StorageService.java (line 2083) requesting GC to free disk space I guess we can get rid of that? Should be nothing good from calling full GC there anymore? If anything, maybe we should consider a variation here where cassandra instead aborts all repairs and compactions and clean up everything before trying again with smaller amounts of data in each compaction/repair attempt? Another ticket on that maybe?
          Hide
          Terje Marthinussen added a comment -

          Looking a bit more on files and logs from the node with full disk.

          I suspect things mainly went wrong due to 2 repairs (if they actually both ran)
          1. Seems there are some Compacted files around which does not get cleaned up, but I guess these may be results of references acquired which are not freed as the streaming fills up the disk and fails.
          2. there are no less but 53 tmp files. A lot of concurrent streams here!

          I still think it may be a good idea if we could detect that we were heading for full disk and abort everything before things crashed.

          Show
          Terje Marthinussen added a comment - Looking a bit more on files and logs from the node with full disk. I suspect things mainly went wrong due to 2 repairs (if they actually both ran) 1. Seems there are some Compacted files around which does not get cleaned up, but I guess these may be results of references acquired which are not freed as the streaming fills up the disk and fails. 2. there are no less but 53 tmp files. A lot of concurrent streams here! I still think it may be a good idea if we could detect that we were heading for full disk and abort everything before things crashed.
          Hide
          Jonathan Ellis added a comment -

          I still think it may be a good idea if we could detect that we were heading for full disk and abort everything before things crashed

          That's a good feature to have, but out of scope here.

          CASSANDRA-809 is open as a general "deal better with running out of space" ticket, but we can open more specific ones.

          Show
          Jonathan Ellis added a comment - I still think it may be a good idea if we could detect that we were heading for full disk and abort everything before things crashed That's a good feature to have, but out of scope here. CASSANDRA-809 is open as a general "deal better with running out of space" ticket, but we can open more specific ones.
          Hide
          Sylvain Lebresne added a comment -

          but I guess these may be results of references acquired which are not freed as the streaming fills up the disk and fails.

          Yes, until we have CASSANDRA-2433 (rebased with this), we don't detect failing streaming and thus never delete the files (until restart). Which btw make me say that we better have CASSANDRA-2433 if we have this ticket. But that was the plan anyway.

          there are no less but 53 tmp files. A lot of concurrent streams here!

          Though it is not related to this ticket, I'll note that CASSANDRA-2816 only stagger the merkle tree creation, not the streaming. That is, the streaming will be staggered to some extends, but if the streaming part is much longer than the merkle tree creation one, you will still have lots of concurrent stream going on. But -tmp files also includes the compaction that are going on, and failed repair leaves -tmp file around, which could also help explaining there large number. In any case, this is not related to the issue at hand

          However I noticed this in the log:
          INFO [Thread-185] 2011-06-28 05:01:15,390 StorageService.java (line 2083) requesting GC to free disk space
          I guess we can get rid of that?

          In some cases (mmap with non-sun jvm at least) we are still relying on the GC to free space.

          Terje, if you can confirm that you didn't saw something utterly wrong with the last patch (related to that patch, no repair), I'll commit it. I think having it in trunk quicker will help with having more testing quicker. And given that we don't want to have bugs in our "force unmapping" it'll be a good thing. In particular, could be good to have someone try that on windows.

          Show
          Sylvain Lebresne added a comment - but I guess these may be results of references acquired which are not freed as the streaming fills up the disk and fails. Yes, until we have CASSANDRA-2433 (rebased with this), we don't detect failing streaming and thus never delete the files (until restart). Which btw make me say that we better have CASSANDRA-2433 if we have this ticket. But that was the plan anyway. there are no less but 53 tmp files. A lot of concurrent streams here! Though it is not related to this ticket, I'll note that CASSANDRA-2816 only stagger the merkle tree creation, not the streaming. That is, the streaming will be staggered to some extends, but if the streaming part is much longer than the merkle tree creation one, you will still have lots of concurrent stream going on. But -tmp files also includes the compaction that are going on, and failed repair leaves -tmp file around, which could also help explaining there large number. In any case, this is not related to the issue at hand However I noticed this in the log: INFO [Thread-185] 2011-06-28 05:01:15,390 StorageService.java (line 2083) requesting GC to free disk space I guess we can get rid of that? In some cases (mmap with non-sun jvm at least) we are still relying on the GC to free space. Terje, if you can confirm that you didn't saw something utterly wrong with the last patch (related to that patch, no repair), I'll commit it. I think having it in trunk quicker will help with having more testing quicker. And given that we don't want to have bugs in our "force unmapping" it'll be a good thing. In particular, could be good to have someone try that on windows.
          Hide
          Terje Marthinussen added a comment -

          Latest iteration of this patch look great in my case, but please note again, I have not tested mmap.

          Maybe you can make the "full GC" trigger conditional if mmap is used?

          Only thing missing beyond that is to get this into 0.8.

          Disk use is very significantly improved in some cases and I think this could easily be the biggest improvement you can do in 0.8.2 in terms of operation friendliness.

          Regarding the above mentioned tmp tables, there is definitely issues there, but don't think it is related to this patch so feel free to merge.

          Show
          Terje Marthinussen added a comment - Latest iteration of this patch look great in my case, but please note again, I have not tested mmap. Maybe you can make the "full GC" trigger conditional if mmap is used? Only thing missing beyond that is to get this into 0.8. Disk use is very significantly improved in some cases and I think this could easily be the biggest improvement you can do in 0.8.2 in terms of operation friendliness. Regarding the above mentioned tmp tables, there is definitely issues there, but don't think it is related to this patch so feel free to merge.
          Hide
          Terje Marthinussen added a comment -

          I have not found any further major issues here, but I think there is still situations where files are deleted late but they do seem to go away.

          Not sure if we are missing something in terms of reference counting and GC delete it eventually or it is just a delayed free or delete for some reason, but it does not happen too often.

          Will see try to add some debug logging and see what I find.

          I was looking at the code though and I am wondering about one segment. I have not had time to actually test this, but in submitUserDefined() there is a finally statement removing References for "sstables" but I could not immediately see where there are References acquired for all the sstables that needs to be freed there?

          I am sure it's just me missing something, but anyway...

          Show
          Terje Marthinussen added a comment - I have not found any further major issues here, but I think there is still situations where files are deleted late but they do seem to go away. Not sure if we are missing something in terms of reference counting and GC delete it eventually or it is just a delayed free or delete for some reason, but it does not happen too often. Will see try to add some debug logging and see what I find. I was looking at the code though and I am wondering about one segment. I have not had time to actually test this, but in submitUserDefined() there is a finally statement removing References for "sstables" but I could not immediately see where there are References acquired for all the sstables that needs to be freed there? I am sure it's just me missing something, but anyway...
          Hide
          Sylvain Lebresne added a comment -

          Not sure if we are missing something in terms of reference counting and GC delete it eventually

          If you don't use mmap, the GC shouldn't do anything. So if it is deleted eventually, it would indicate some place where we last decrement is delayed longer that it needs somehow. It'd be interesting if you find more.

          in submitUserDefined() there is a finally statement removing References for "sstables" but I could not immediately see where there are References acquired

          It's in the lookupSSTables (a private method used by submitUserDefined). I admit it's not super clean but It felt like the simplest way to do this in a thread safe manner without holding unneeded references for too long.

          Only thing missing beyond that is to get this into 0.8.

          I really don't think this would be reasonable. This is not a trivial change by any mean, nor does it fixes a regression. Which is not saying it doesn't make life much easier. But I'm really uncomfortable pushing that to 0.8.

          Show
          Sylvain Lebresne added a comment - Not sure if we are missing something in terms of reference counting and GC delete it eventually If you don't use mmap, the GC shouldn't do anything. So if it is deleted eventually, it would indicate some place where we last decrement is delayed longer that it needs somehow. It'd be interesting if you find more. in submitUserDefined() there is a finally statement removing References for "sstables" but I could not immediately see where there are References acquired It's in the lookupSSTables (a private method used by submitUserDefined). I admit it's not super clean but It felt like the simplest way to do this in a thread safe manner without holding unneeded references for too long. Only thing missing beyond that is to get this into 0.8. I really don't think this would be reasonable. This is not a trivial change by any mean, nor does it fixes a regression. Which is not saying it doesn't make life much easier. But I'm really uncomfortable pushing that to 0.8.
          Hide
          Terje Marthinussen added a comment -

          In releaseReference(), if holdReferences for some reason gets less than 0, maybe the code should do an assert or throw an exception (or anything else that gives a stack trace)?

          Should help debugging for some error scenarios with reference mismatches.

          Show
          Terje Marthinussen added a comment - In releaseReference(), if holdReferences for some reason gets less than 0, maybe the code should do an assert or throw an exception (or anything else that gives a stack trace)? Should help debugging for some error scenarios with reference mismatches.
          Hide
          Terje Marthinussen added a comment -

          I found an error in how the patch had gotten merged in our code.
          After fixing that, I see no Compacted files around that I cannot explain.

          This found the the error:

              public void releaseReference()
              {
                  if (holdReferences.decrementAndGet() == 0 && isCompacted.get())
                  {
                      // Force finalizing mmapping if necessary
                      ifile.cleanup();
                      dfile.cleanup();
          
                      deletingTask.schedule();
                  }
                  assert holdReferences.get() >= 0 : "Reference counter " +  holdReferences.get() + " for " + dfile.path;
           
              }
          

          that assert turned quite useful so a recommended addon.

          Been doing a lot of updating, repairing, querying since then on a few system here ranging from 3 nodes with 100-200 million docs to 12 nodes with more than a billion. Nothing abnormal so far. Things that made us run out of disk before or sent nodes into full GC land, now works.

          I would still like to see that full GC is not called if disk runs past 90% usage as and mmap is not used. No point halting the world if not needed?

          Show
          Terje Marthinussen added a comment - I found an error in how the patch had gotten merged in our code. After fixing that, I see no Compacted files around that I cannot explain. This found the the error: public void releaseReference() { if (holdReferences.decrementAndGet() == 0 && isCompacted.get()) { // Force finalizing mmapping if necessary ifile.cleanup(); dfile.cleanup(); deletingTask.schedule(); } assert holdReferences.get() >= 0 : "Reference counter " + holdReferences.get() + " for " + dfile.path; } that assert turned quite useful so a recommended addon. Been doing a lot of updating, repairing, querying since then on a few system here ranging from 3 nodes with 100-200 million docs to 12 nodes with more than a billion. Nothing abnormal so far. Things that made us run out of disk before or sent nodes into full GC land, now works. I would still like to see that full GC is not called if disk runs past 90% usage as and mmap is not used. No point halting the world if not needed?
          Hide
          Jeff Kesselman added a comment -

          Just a note: while relying on phantom references or finalizer for your cleanup is incorrect, it is reasonable to use them to do a verification that your cleanup has occurred on those objects that do get collected. This is with the caveat however that PhantomReferences always add something to your gc costs so you might want to make it switchable.

          Show
          Jeff Kesselman added a comment - Just a note: while relying on phantom references or finalizer for your cleanup is incorrect, it is reasonable to use them to do a verification that your cleanup has occurred on those objects that do get collected. This is with the caveat however that PhantomReferences always add something to your gc costs so you might want to make it switchable.
          Hide
          Sylvain Lebresne added a comment -

          It turns out that this patch triggers CASSANDRA-2872 during the unit tests. So let's maybe fix this first before committing to avoids having know failing unit tests.

          Show
          Sylvain Lebresne added a comment - It turns out that this patch triggers CASSANDRA-2872 during the unit tests. So let's maybe fix this first before committing to avoids having know failing unit tests.
          Hide
          Sylvain Lebresne added a comment -

          Attaching v5. It is rebased and now all unit tests are passing. It also adds a few requested comments and only triggers GC for disk space when that could possibly be useful (i.e, when mmap is used on a non sun jvm (more precisely, if the mmap "cleaner" is not available)).

          Will commmit this based on earlier +1 and on Terje successful testing (thanks a lot for that btw).

          Show
          Sylvain Lebresne added a comment - Attaching v5. It is rebased and now all unit tests are passing. It also adds a few requested comments and only triggers GC for disk space when that could possibly be useful (i.e, when mmap is used on a non sun jvm (more precisely, if the mmap "cleaner" is not available)). Will commmit this based on earlier +1 and on Terje successful testing (thanks a lot for that btw).
          Hide
          Sylvain Lebresne added a comment -

          Committed, thanks all.

          Show
          Sylvain Lebresne added a comment - Committed, thanks all.
          Hide
          Hudson added a comment -

          Integrated in Cassandra #967 (See https://builds.apache.org/job/Cassandra/967/)
          Use reference counting to delete sstables instead of relying on the GC
          patch by slebresne; reviewed by jbellis for CASSANDRA-2521

          slebresne : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1149085
          Files :

          • /cassandra/trunk/test/unit/org/apache/cassandra/io/sstable/SSTableUtils.java
          • /cassandra/trunk/src/java/org/apache/cassandra/service/AntiEntropyService.java
          • /cassandra/trunk/src/java/org/apache/cassandra/config/DatabaseDescriptor.java
          • /cassandra/trunk/src/java/org/apache/cassandra/db/DataTracker.java
          • /cassandra/trunk/src/java/org/apache/cassandra/db/Table.java
          • /cassandra/trunk/src/java/org/apache/cassandra/db/compaction/CompactionManager.java
          • /cassandra/trunk/src/java/org/apache/cassandra/db/ColumnFamilyStore.java
          • /cassandra/trunk/src/java/org/apache/cassandra/streaming/StreamOutSession.java
          • /cassandra/trunk/test/unit/org/apache/cassandra/streaming/SerializationsTest.java
          • /cassandra/trunk/src/java/org/apache/cassandra/service/StorageService.java
          • /cassandra/trunk/src/java/org/apache/cassandra/io/util/BufferedSegmentedFile.java
          • /cassandra/trunk/src/java/org/apache/cassandra/io/util/SegmentedFile.java
          • /cassandra/trunk/src/java/org/apache/cassandra/io/util/MmappedSegmentedFile.java
          • /cassandra/trunk/src/java/org/apache/cassandra/io/sstable/SSTableDeletingTask.java
          • /cassandra/trunk/src/java/org/apache/cassandra/service/StorageServiceMBean.java
          • /cassandra/trunk/CHANGES.txt
          • /cassandra/trunk/src/java/org/apache/cassandra/streaming/PendingFile.java
          • /cassandra/trunk/src/java/org/apache/cassandra/streaming/StreamInSession.java
          • /cassandra/trunk/src/java/org/apache/cassandra/streaming/StreamOut.java
          • /cassandra/trunk/src/java/org/apache/cassandra/service/GCInspector.java
          • /cassandra/trunk/test/unit/org/apache/cassandra/streaming/StreamingTransferTest.java
          • /cassandra/trunk/src/java/org/apache/cassandra/io/sstable/SSTableReader.java
          • /cassandra/trunk/src/java/org/apache/cassandra/io/sstable/SSTableDeletingReference.java
          Show
          Hudson added a comment - Integrated in Cassandra #967 (See https://builds.apache.org/job/Cassandra/967/ ) Use reference counting to delete sstables instead of relying on the GC patch by slebresne; reviewed by jbellis for CASSANDRA-2521 slebresne : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1149085 Files : /cassandra/trunk/test/unit/org/apache/cassandra/io/sstable/SSTableUtils.java /cassandra/trunk/src/java/org/apache/cassandra/service/AntiEntropyService.java /cassandra/trunk/src/java/org/apache/cassandra/config/DatabaseDescriptor.java /cassandra/trunk/src/java/org/apache/cassandra/db/DataTracker.java /cassandra/trunk/src/java/org/apache/cassandra/db/Table.java /cassandra/trunk/src/java/org/apache/cassandra/db/compaction/CompactionManager.java /cassandra/trunk/src/java/org/apache/cassandra/db/ColumnFamilyStore.java /cassandra/trunk/src/java/org/apache/cassandra/streaming/StreamOutSession.java /cassandra/trunk/test/unit/org/apache/cassandra/streaming/SerializationsTest.java /cassandra/trunk/src/java/org/apache/cassandra/service/StorageService.java /cassandra/trunk/src/java/org/apache/cassandra/io/util/BufferedSegmentedFile.java /cassandra/trunk/src/java/org/apache/cassandra/io/util/SegmentedFile.java /cassandra/trunk/src/java/org/apache/cassandra/io/util/MmappedSegmentedFile.java /cassandra/trunk/src/java/org/apache/cassandra/io/sstable/SSTableDeletingTask.java /cassandra/trunk/src/java/org/apache/cassandra/service/StorageServiceMBean.java /cassandra/trunk/CHANGES.txt /cassandra/trunk/src/java/org/apache/cassandra/streaming/PendingFile.java /cassandra/trunk/src/java/org/apache/cassandra/streaming/StreamInSession.java /cassandra/trunk/src/java/org/apache/cassandra/streaming/StreamOut.java /cassandra/trunk/src/java/org/apache/cassandra/service/GCInspector.java /cassandra/trunk/test/unit/org/apache/cassandra/streaming/StreamingTransferTest.java /cassandra/trunk/src/java/org/apache/cassandra/io/sstable/SSTableReader.java /cassandra/trunk/src/java/org/apache/cassandra/io/sstable/SSTableDeletingReference.java

            People

            • Assignee:
              Sylvain Lebresne
              Reporter:
              Chris Goffinet
              Reviewer:
              Jonathan Ellis
            • Votes:
              14 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development