Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-7386

JBOD threshold to prevent unbalanced disk utilization

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Fix Version/s: 2.0.12, 2.1.3
    • Component/s: None
    • Labels:
      None

      Description

      Currently the pick the disks are picked first by number of current tasks, then by free space. This helps with performance but can lead to large differences in utilization in some (unlikely but possible) scenarios. Ive seen 55% to 10% and heard reports of 90% to 10% on IRC. With both LCS and STCS (although my suspicion is that STCS makes it worse since harder to be balanced).

      I purpose the algorithm change a little to have some maximum range of utilization where it will pick by free space over load (acknowledging it can be slower). So if a disk A is 30% full and disk B is 5% full it will never pick A over B until it balances out.

      1. 7386-2.0-v3.txt
        16 kB
        Robert Stupp
      2. 7386-2.0-v4.txt
        23 kB
        Robert Stupp
      3. 7386-2.0-v5.txt
        148 kB
        Robert Stupp
      4. 7386-2.1-v3.txt
        18 kB
        Robert Stupp
      5. 7386-2.1-v4.txt
        26 kB
        Robert Stupp
      6. 7386-2.1-v5.txt
        26 kB
        Robert Stupp
      7. 7386-v1.patch
        15 kB
        Robert Stupp
      8. 7386v2.diff
        23 kB
        Robert Stupp
      9. Mappe1.ods
        62 kB
        Robert Stupp
      10. mean-writevalue-7disks.png
        48 kB
        Lyuben Todorov
      11. patch_2_1_branch_proto.diff
        10 kB
        Chris Lohfink
      12. sstable-count-second-run.png
        30 kB
        Lyuben Todorov
      13. test_regression_no_patch.jpg
        84 kB
        Alan Boudreault
      14. test_regression_with_patch.jpg
        94 kB
        Alan Boudreault
      15. test1_no_patch.jpg
        53 kB
        Alan Boudreault
      16. test1_with_patch.jpg
        59 kB
        Alan Boudreault
      17. test2_no_patch.jpg
        65 kB
        Alan Boudreault
      18. test2_with_patch.jpg
        69 kB
        Alan Boudreault
      19. test3_no_patch.jpg
        59 kB
        Alan Boudreault
      20. test3_with_patch.jpg
        73 kB
        Alan Boudreault

        Issue Links

          Activity

          Hide
          cnlwsu Chris Lohfink added a comment -

          attached to see if that approach is acceptable. Will need to clean up a little (make threshold config value or at least a constant).

          Show
          cnlwsu Chris Lohfink added a comment - attached to see if that approach is acceptable. Will need to clean up a little (make threshold config value or at least a constant).
          Hide
          jbellis Jonathan Ellis added a comment -

          Once it's full enough then C* will start using the less-full disk since there is no room on the other. Until then, maybe the less-full disk will get balanced better out of randomness. Forcing this behavior early just turns the possibility of degraded performance later, into a certainty of degraded performance now.

          Am I missing something?

          Show
          jbellis Jonathan Ellis added a comment - Once it's full enough then C* will start using the less-full disk since there is no room on the other. Until then, maybe the less-full disk will get balanced better out of randomness. Forcing this behavior early just turns the possibility of degraded performance later, into a certainty of degraded performance now. Am I missing something?
          Hide
          cnlwsu Chris Lohfink added a comment -

          Where I have seen it become a problem is more on high read throughput side. A lot of reads + compactions (repairs) end up utilizing one disk pretty heavily since it contains most of the data while other disk sitting comparatively idle in iostat.

          That said its an awful lotta maybe's and statistically not expected scenarios. This may not be worth adding complexity for bottom %'s so feel free to mark as wont fix if not worth it. Idea was more to put some threshold to only take the likely performance hit to prevent the extreme cases. The 10% range difference in the prototype impl attached is probably a bad example since I would see it more around 25-30%.

          An offline tool to balance things out after the fact may be a adequate solution that can be developed outside of C* as well.

          Show
          cnlwsu Chris Lohfink added a comment - Where I have seen it become a problem is more on high read throughput side. A lot of reads + compactions (repairs) end up utilizing one disk pretty heavily since it contains most of the data while other disk sitting comparatively idle in iostat. That said its an awful lotta maybe's and statistically not expected scenarios. This may not be worth adding complexity for bottom %'s so feel free to mark as wont fix if not worth it. Idea was more to put some threshold to only take the likely performance hit to prevent the extreme cases. The 10% range difference in the prototype impl attached is probably a bad example since I would see it more around 25-30%. An offline tool to balance things out after the fact may be a adequate solution that can be developed outside of C* as well.
          Hide
          snazy Robert Stupp added a comment -

          Jonathan Ellis yep - maybe (if I understand correctly what Directories.DataDirectories.currentTasks means). currentTasks just reflects the number of concurrently running tasks - a very volatile value that is likely to not reflect the real load. This value is used for the disk comparator. IMO it is better to track the load of the disk (similar to Unix system load) and not the currently active tasks. Load calculation could take place at 5sec intervals using the sum of all tasks. I have a utility class for configurable load calculation - can provide it. Adopting load to Directories is a trivial change - it just requires to get called every 1, 2, 3 or 5 seconds to update the load of the directory.

          Show
          snazy Robert Stupp added a comment - Jonathan Ellis yep - maybe (if I understand correctly what Directories.DataDirectories.currentTasks means). currentTasks just reflects the number of concurrently running tasks - a very volatile value that is likely to not reflect the real load. This value is used for the disk comparator . IMO it is better to track the load of the disk (similar to Unix system load) and not the currently active tasks. Load calculation could take place at 5sec intervals using the sum of all tasks. I have a utility class for configurable load calculation - can provide it. Adopting load to Directories is a trivial change - it just requires to get called every 1, 2, 3 or 5 seconds to update the load of the directory.
          Hide
          jbellis Jonathan Ellis added a comment -

          IMO it is better to track the load of the disk (similar to Unix system load) and not the currently active tasks.

          That's a good idea. In fact, we already track reads-per-sstable, so I don't think we'd need to add any extra metrics to do this.

          Show
          jbellis Jonathan Ellis added a comment - IMO it is better to track the load of the disk (similar to Unix system load) and not the currently active tasks. That's a good idea. In fact, we already track reads-per-sstable, so I don't think we'd need to add any extra metrics to do this.
          Hide
          snazy Robert Stupp added a comment -

          we already track reads-per-sstable

          So shall we add load calculation to the existing metric?

          Show
          snazy Robert Stupp added a comment - we already track reads-per-sstable So shall we add load calculation to the existing metric?
          Hide
          jbellis Jonathan Ellis added a comment -

          Yes. Do you want to give that a try?

          Show
          jbellis Jonathan Ellis added a comment - Yes. Do you want to give that a try?
          Hide
          snazy Robert Stupp added a comment -

          yep - i'll start coding this evening or tomorrow

          Show
          snazy Robert Stupp added a comment - yep - i'll start coding this evening or tomorrow
          Hide
          snazy Robert Stupp added a comment -

          Here's an initial draft version that collects metrics on data directories and chooses the directory to write to on a value calculated from current task load and free disk ratio. It's not finished yet because I still have to think about how to weight load and free ratio - maybe also the "write size", too.

          Show
          snazy Robert Stupp added a comment - Here's an initial draft version that collects metrics on data directories and chooses the directory to write to on a value calculated from current task load and free disk ratio . It's not finished yet because I still have to think about how to weight load and free ratio - maybe also the "write size", too.
          Hide
          snazy Robert Stupp added a comment -

          Here's a working version of the patch.

          It adds new metrics to each data directory:

          • readTasks counts the read requests
          • writeTasks counts the write requests
          • writeValue* exposes the "write value" for each data directory for mean, one/five/fifteen minutes

          The data directory with the highest "write value" is chosen for new sstables.

          "Write value" is calculated using the formula:
          freeRatio / weightedRate where freeRatio = availableBytes / totalBytes and weightedRate = writeRate + readRate / 2. "divide by 2" has been randomly chosen since not every read operation hits the disks.

          readRate is taken from SSTableReader.incrementReadCount() but I had to add a call to incrementReadCount() to some classes in code. I did not add it to RandomAccessReader or SegmentedFile because this patch should not influence performance too much.

          I did not experience much with the formula but created a sheet (Mappe1.ods) that shows the "write value" in a matrix if freeRatio vs. weightedRate.

          I've run cassandra-stress against (a single node, single data-directory) C* instance and saw that the writeValue behaves as expected.

          But that's only half the battle. The patch has to be verified in a "real production like cluster". "weight value" needs to be compared with iostat, df etc. Is there any possibility to do that?

          Show
          snazy Robert Stupp added a comment - Here's a working version of the patch. It adds new metrics to each data directory: readTasks counts the read requests writeTasks counts the write requests writeValue* exposes the "write value" for each data directory for mean, one/five/fifteen minutes The data directory with the highest "write value" is chosen for new sstables. "Write value" is calculated using the formula: freeRatio / weightedRate where freeRatio = availableBytes / totalBytes and weightedRate = writeRate + readRate / 2 . "divide by 2" has been randomly chosen since not every read operation hits the disks. readRate is taken from SSTableReader.incrementReadCount() but I had to add a call to incrementReadCount() to some classes in code. I did not add it to RandomAccessReader or SegmentedFile because this patch should not influence performance too much. I did not experience much with the formula but created a sheet ( Mappe1.ods ) that shows the "write value" in a matrix if freeRatio vs. weightedRate. I've run cassandra-stress against (a single node, single data-directory) C* instance and saw that the writeValue behaves as expected. But that's only half the battle. The patch has to be verified in a "real production like cluster". "weight value" needs to be compared with iostat , df etc. Is there any possibility to do that?
          Hide
          jbellis Jonathan Ellis added a comment -

          Yuki Morishita to review

          Show
          jbellis Jonathan Ellis added a comment - Yuki Morishita to review
          Hide
          yukim Yuki Morishita added a comment -

          Robert Stupp Can we just calculate ```weightedRate = writeRate + readRate```? Since readRate is incremented through SSTableReader#incrementReadCount which is invoked when actually hiting disk. Otherwise, it is good.

          I will find the JBOD configured boxes to see how this performs.

          Show
          yukim Yuki Morishita added a comment - Robert Stupp Can we just calculate ```weightedRate = writeRate + readRate```? Since readRate is incremented through SSTableReader#incrementReadCount which is invoked when actually hiting disk. Otherwise, it is good. I will find the JBOD configured boxes to see how this performs.
          Hide
          snazy Robert Stupp added a comment -

          Can we just calculate ```weightedRate = writeRate + readRate```

          Of course - but I'm pretty sure that the math will need some adjustments due to emperically determined test results since the background for this ticket is to find a disk that is not "hammered" with current disk I/O and has enough free disk space.

          Show
          snazy Robert Stupp added a comment - Can we just calculate ```weightedRate = writeRate + readRate``` Of course - but I'm pretty sure that the math will need some adjustments due to emperically determined test results since the background for this ticket is to find a disk that is not "hammered" with current disk I/O and has enough free disk space.
          Hide
          snazy Robert Stupp added a comment -

          ping - just want to know how the JBOD boxes performed

          Show
          snazy Robert Stupp added a comment - ping - just want to know how the JBOD boxes performed
          Hide
          jbellis Jonathan Ellis added a comment -

          Yuki Morishita / Robert Stupp, if you specify the tests you'd like to run I can have Lyuben Todorov run them on one of our 8-disk boxes.

          Show
          jbellis Jonathan Ellis added a comment - Yuki Morishita / Robert Stupp , if you specify the tests you'd like to run I can have Lyuben Todorov run them on one of our 8-disk boxes.
          Hide
          snazy Robert Stupp added a comment -

          MBeans give information about the "write value" for each data directory. These should be monitored.
          These tests should be performed with and without the patch.

          One test is similar to what's reported in CASSANDRA-7615:

          1. start with half of the disks configured for C* data directories
          2. add data (using stress tool?)
          3. after some time, add more data directories
          4. add more data (using a different keyspace, see below)
          5. new sstables should prefer the new data directories (they should have a better "write value")
          6. at some point, new sstables should be distributed equally over all disks and result in approx same utilization

          Next test is to check that heavily utilized (read or write ops) directories are not chosen for new sstables.

          1. "Hammer" the first directories (the first keyspace) from the previous test with compactions or repairs
          2. Add more data to a new keyspace
          3. New sstables should go to the 2nd set of data directories.
          Show
          snazy Robert Stupp added a comment - MBeans give information about the "write value" for each data directory. These should be monitored. These tests should be performed with and without the patch. One test is similar to what's reported in CASSANDRA-7615 : start with half of the disks configured for C* data directories add data (using stress tool?) after some time, add more data directories add more data (using a different keyspace, see below) new sstables should prefer the new data directories (they should have a better "write value") at some point, new sstables should be distributed equally over all disks and result in approx same utilization Next test is to check that heavily utilized (read or write ops) directories are not chosen for new sstables. "Hammer" the first directories (the first keyspace) from the previous test with compactions or repairs Add more data to a new keyspace New sstables should go to the 2nd set of data directories.
          Hide
          yukim Yuki Morishita added a comment - - edited

          Lyuben Todorov I have branch that writes WriteValues to csv for analysis. Feel free to use it.

          Show
          yukim Yuki Morishita added a comment - - edited Lyuben Todorov I have branch that writes WriteValues to csv for analysis. Feel free to use it.
          Hide
          lyubent Lyuben Todorov added a comment - - edited

          After loading 750GB of data into a 7 disk cluster the test shows what we expected, the drives that had data written to them previously do indeed have lower Write values and these values increase as the rest of the drives begin filling up (view graph).

          I also tracked sstable creation and compaction. To sum it up, on the first run only 2 drives were used, on the second run another five were added. At the end of the first run d1 and d2 (disk1 and disk2 respectively) were saturated with sstables, d1 had 357 and d2 318. When the second run was started another ks, second_run, was created and an additional 5 disks were used in the node. A majority of sstables were sent to the five new directories as expected (view piechart).

          The second test to see if busy disks are used for new tables is coming up.

          Show
          lyubent Lyuben Todorov added a comment - - edited After loading 750GB of data into a 7 disk cluster the test shows what we expected, the drives that had data written to them previously do indeed have lower Write values and these values increase as the rest of the drives begin filling up (view graph). I also tracked sstable creation and compaction. To sum it up, on the first run only 2 drives were used, on the second run another five were added. At the end of the first run d1 and d2 (disk1 and disk2 respectively) were saturated with sstables, d1 had 357 and d2 318. When the second run was started another ks, second_run, was created and an additional 5 disks were used in the node. A majority of sstables were sent to the five new directories as expected (view piechart). The second test to see if busy disks are used for new tables is coming up.
          Hide
          jbellis Jonathan Ellis added a comment -

          I'm not sure what I'm looking at in the first graph, but "on the first run only 2 drives were used" doesn't sound like it's working very well.

          Show
          jbellis Jonathan Ellis added a comment - I'm not sure what I'm looking at in the first graph, but "on the first run only 2 drives were used" doesn't sound like it's working very well.
          Hide
          snazy Robert Stupp added a comment -

          I guess, what Lyuben Todorov sais is, that he started with C* configured to use 2 disks for data directories. Then added data to it.
          After that, he restarted C* with 5 more drives (7 in sum now) and resumed adding data to the node.
          The 5 new drives were used for new sstables (because they have a higher "write value" since these 5 new drives have more disk space left).
          So it seems to work as expected.

          I'm looking forward to see the results with busy disk results, when I/O read/write usage comes into play.

          Show
          snazy Robert Stupp added a comment - I guess, what Lyuben Todorov sais is, that he started with C* configured to use 2 disks for data directories. Then added data to it. After that, he restarted C* with 5 more drives (7 in sum now) and resumed adding data to the node. The 5 new drives were used for new sstables (because they have a higher "write value" since these 5 new drives have more disk space left). So it seems to work as expected. I'm looking forward to see the results with busy disk results, when I/O read/write usage comes into play.
          Hide
          jbellis Jonathan Ellis added a comment -

          Ah, that makes more sense.

          Show
          jbellis Jonathan Ellis added a comment - Ah, that makes more sense.
          Hide
          lyubent Lyuben Todorov added a comment -

          For clarity's sake I'll add the script that lead to the data used for the above graphs. P.S. /cc Robert Stupp exactly what I was trying to say. Started with 2 drives to see if adding another 5 later would lead to higher writevalues in the newly added drives (all 5 of them ended up having higher write values).

          Show
          lyubent Lyuben Todorov added a comment - For clarity's sake I'll add the script that lead to the data used for the above graphs. P.S. /cc Robert Stupp exactly what I was trying to say. Started with 2 drives to see if adding another 5 later would lead to higher writevalues in the newly added drives (all 5 of them ended up having higher write values).
          Hide
          snazy Robert Stupp added a comment -

          just a ping

          Show
          snazy Robert Stupp added a comment - just a ping
          Hide
          snazy Robert Stupp added a comment -

          The patch should still apply w/o conflicts.

          Show
          snazy Robert Stupp added a comment - The patch should still apply w/o conflicts.
          Hide
          jbellis Jonathan Ellis added a comment -

          The data directory with the highest "write value" is chosen for new sstables. Write value" is calculated using the formula freeRatio / weightedRate

          That doesn't sound right. So the busier the drive is, the more likely we'll add more sstables to it? We want to put new sstables on un-busy drives so they hopefully become busier.

          Show
          jbellis Jonathan Ellis added a comment - The data directory with the highest "write value" is chosen for new sstables. Write value" is calculated using the formula freeRatio / weightedRate That doesn't sound right. So the busier the drive is, the more likely we'll add more sstables to it? We want to put new sstables on un-busy drives so they hopefully become busier.
          Hide
          jblangston@datastax.com J.B. Langston added a comment - - edited

          I've seen a lot of users hitting this issue lately, so the sooner we can get a patch the better. This also needs to be back ported to 2.0 if at all possible. In several cases I've seen severe imbalances like the ones described where there are some drives completely full and others at 10-20% utilization.

          Here are a couple of stack traces. It happens both during flushes and compactions.

          ERROR [FlushWriter:6241] 2014-09-07 08:27:35,298 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:6241,5,main]
          FSWriteError in /data6/system/compactions_in_progress/system-compactions_in_progress-tmp-jb-8222-Index.db
          	at org.apache.cassandra.io.util.SequentialWriter.flushData(SequentialWriter.java:267)
          	at org.apache.cassandra.io.util.SequentialWriter.flushInternal(SequentialWriter.java:219)
          	at org.apache.cassandra.io.util.SequentialWriter.syncInternal(SequentialWriter.java:191)
          	at org.apache.cassandra.io.util.SequentialWriter.close(SequentialWriter.java:381)
          	at org.apache.cassandra.io.sstable.SSTableWriter$IndexWriter.close(SSTableWriter.java:481)
          	at org.apache.cassandra.io.util.FileUtils.closeQuietly(FileUtils.java:212)
          	at org.apache.cassandra.io.sstable.SSTableWriter.abort(SSTableWriter.java:301)
          	at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:417)
          	at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
          	at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
          	at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
          	at java.lang.Thread.run(Thread.java:744)
          Caused by: java.io.IOException: No space left on device
          	at java.io.RandomAccessFile.writeBytes0(Native Method)
          	at java.io.RandomAccessFile.writeBytes(RandomAccessFile.java:520)
          	at java.io.RandomAccessFile.write(RandomAccessFile.java:550)
          	at org.apache.cassandra.io.util.SequentialWriter.flushData(SequentialWriter.java:263)
          	... 13 more
          
          ERROR [CompactionExecutor:9166] 2014-09-06 16:09:14,786 CassandraDaemon.java (line 198) Exception in thread Thread[CompactionExecutor:9166,1,main]
          FSWriteError in /data6/keyspace_1/data/keyspace_1-data-tmp-jb-13599-Filter.db
          	at org.apache.cassandra.io.sstable.SSTableWriter$IndexWriter.close(SSTableWriter.java:475)
          	at org.apache.cassandra.io.util.FileUtils.closeQuietly(FileUtils.java:212)
          	at org.apache.cassandra.io.sstable.SSTableWriter.abort(SSTableWriter.java:301)
          	at org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:209)
          	at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
          	at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
          	at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60)
          	at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59)
          	at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:197)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
          	at java.lang.Thread.run(Thread.java:744)
          Caused by: java.io.IOException: No space left on device
          	at java.io.FileOutputStream.write(Native Method)
          	at java.io.FileOutputStream.write(FileOutputStream.java:295)
          	at java.io.DataOutputStream.writeInt(DataOutputStream.java:197)
          	at org.apache.cassandra.utils.BloomFilterSerializer.serialize(BloomFilterSerializer.java:34)
          	at org.apache.cassandra.utils.Murmur3BloomFilter$Murmur3BloomFilterSerializer.serialize(Murmur3BloomFilter.java:44)
          	at org.apache.cassandra.utils.FilterFactory.serialize(FilterFactory.java:41)
          	at org.apache.cassandra.io.sstable.SSTableWriter$IndexWriter.close(SSTableWriter.java:468)
          	... 13 more
          
          Show
          jblangston@datastax.com J.B. Langston added a comment - - edited I've seen a lot of users hitting this issue lately, so the sooner we can get a patch the better. This also needs to be back ported to 2.0 if at all possible. In several cases I've seen severe imbalances like the ones described where there are some drives completely full and others at 10-20% utilization. Here are a couple of stack traces. It happens both during flushes and compactions. ERROR [FlushWriter:6241] 2014-09-07 08:27:35,298 CassandraDaemon.java (line 198) Exception in thread Thread [FlushWriter:6241,5,main] FSWriteError in /data6/system/compactions_in_progress/system-compactions_in_progress-tmp-jb-8222-Index.db at org.apache.cassandra.io.util.SequentialWriter.flushData(SequentialWriter.java:267) at org.apache.cassandra.io.util.SequentialWriter.flushInternal(SequentialWriter.java:219) at org.apache.cassandra.io.util.SequentialWriter.syncInternal(SequentialWriter.java:191) at org.apache.cassandra.io.util.SequentialWriter.close(SequentialWriter.java:381) at org.apache.cassandra.io.sstable.SSTableWriter$IndexWriter.close(SSTableWriter.java:481) at org.apache.cassandra.io.util.FileUtils.closeQuietly(FileUtils.java:212) at org.apache.cassandra.io.sstable.SSTableWriter.abort(SSTableWriter.java:301) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:417) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang. Thread .run( Thread .java:744) Caused by: java.io.IOException: No space left on device at java.io.RandomAccessFile.writeBytes0(Native Method) at java.io.RandomAccessFile.writeBytes(RandomAccessFile.java:520) at java.io.RandomAccessFile.write(RandomAccessFile.java:550) at org.apache.cassandra.io.util.SequentialWriter.flushData(SequentialWriter.java:263) ... 13 more ERROR [CompactionExecutor:9166] 2014-09-06 16:09:14,786 CassandraDaemon.java (line 198) Exception in thread Thread [CompactionExecutor:9166,1,main] FSWriteError in /data6/keyspace_1/data/keyspace_1-data-tmp-jb-13599-Filter.db at org.apache.cassandra.io.sstable.SSTableWriter$IndexWriter.close(SSTableWriter.java:475) at org.apache.cassandra.io.util.FileUtils.closeQuietly(FileUtils.java:212) at org.apache.cassandra.io.sstable.SSTableWriter.abort(SSTableWriter.java:301) at org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:209) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60) at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:197) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang. Thread .run( Thread .java:744) Caused by: java.io.IOException: No space left on device at java.io.FileOutputStream.write(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:295) at java.io.DataOutputStream.writeInt(DataOutputStream.java:197) at org.apache.cassandra.utils.BloomFilterSerializer.serialize(BloomFilterSerializer.java:34) at org.apache.cassandra.utils.Murmur3BloomFilter$Murmur3BloomFilterSerializer.serialize(Murmur3BloomFilter.java:44) at org.apache.cassandra.utils.FilterFactory.serialize(FilterFactory.java:41) at org.apache.cassandra.io.sstable.SSTableWriter$IndexWriter.close(SSTableWriter.java:468) ... 13 more
          Hide
          snazy Robert Stupp added a comment -

          Not really.
          It basically calculates the "value" or a drive using free-space-precentage divided by drive-usage-rate (rate is basically the current # of read+write requests)

          which is intended to mean that

          • drives with more free space have a higher chance to get new sstables
          • drives with low access rate have a higher chance to get new sstables

          or vice versa

          • "hammered" drives are unlikely to get new data
          • full drives are unlikely to get new data
          Show
          snazy Robert Stupp added a comment - Not really. It basically calculates the "value" or a drive using free-space-precentage divided by drive-usage-rate (rate is basically the current # of read+write requests) which is intended to mean that drives with more free space have a higher chance to get new sstables drives with low access rate have a higher chance to get new sstables or vice versa "hammered" drives are unlikely to get new data full drives are unlikely to get new data
          Hide
          jbellis Jonathan Ellis added a comment -

          We have three related problems. One is that the existing design is bad at balancing space used (and as posted initially, I'm okay with this to a degree), a more serious one is that post-CASSANDRA-5605 we can actually run out of space because of this, and finally space-used is actually the wrong thing to optimize for balancing.

          Starting with the last: ultimately, we want to optimize for balanced reads across the disks with enough space. We shouldn't include writes in the metrics because writes are transient. But trying to balance based on target disk readMeter is probably no more useful than disk space; we would need to take hotness of the source sstables into consideration as well, and compact cold sstables to disks with high activity and hot ones to disks with low activity. This is outside the scope of this ticket.

          So, if balancing by disk space is the best we can do, here is an optimal approach:

          1. Compute the total free disk space T as the sum of each disk's free space D
          2. For each disk, assign it D/T of the new sstables. (Weighted random may be easiest.)

          To ensure we never accidentally assign an sstable to a disk that doesn't have room for it, we should also estimate space to be used and restrict our candidates to disks that have room for it. Basically, revert CASSANDRA-5605. But, we don't want to go back to the bad old days of being too pessimistic. So our fallback is, if no disk has space for worst-case estimate, pick the disk with the most free space.

          Show
          jbellis Jonathan Ellis added a comment - We have three related problems. One is that the existing design is bad at balancing space used (and as posted initially, I'm okay with this to a degree), a more serious one is that post- CASSANDRA-5605 we can actually run out of space because of this, and finally space-used is actually the wrong thing to optimize for balancing. Starting with the last: ultimately, we want to optimize for balanced reads across the disks with enough space. We shouldn't include writes in the metrics because writes are transient. But trying to balance based on target disk readMeter is probably no more useful than disk space; we would need to take hotness of the source sstables into consideration as well, and compact cold sstables to disks with high activity and hot ones to disks with low activity. This is outside the scope of this ticket. So, if balancing by disk space is the best we can do, here is an optimal approach: Compute the total free disk space T as the sum of each disk's free space D For each disk, assign it D/T of the new sstables. (Weighted random may be easiest.) To ensure we never accidentally assign an sstable to a disk that doesn't have room for it, we should also estimate space to be used and restrict our candidates to disks that have room for it. Basically, revert CASSANDRA-5605 . But, we don't want to go back to the bad old days of being too pessimistic. So our fallback is, if no disk has space for worst-case estimate, pick the disk with the most free space.
          Hide
          snazy Robert Stupp added a comment -

          Thanks for the explanation.
          The root cause for CASSANDRA-5605 (as far as I understand) was that the sum of "space reserved for compactions" + "required space" did not match any disk.

          I'm with with ignoring write-rate and read-rate - seems reasonable now. Any assumption about the "load" of the disks would depend on the usage of the data (which we cannot foresee ... yet).
          Moving to something else that takes the "hotness" of the sstables into account is the way to go - but not easy. Requires a huge number of metric instances (. Not easy to combine these metrics as a "forecast" for compactions.

          Yea - basically only the current disk space seems to be usable for this ticket. Will make it like that.

          Show
          snazy Robert Stupp added a comment - Thanks for the explanation. The root cause for CASSANDRA-5605 (as far as I understand) was that the sum of "space reserved for compactions" + "required space" did not match any disk. I'm with with ignoring write-rate and read-rate - seems reasonable now. Any assumption about the "load" of the disks would depend on the usage of the data (which we cannot foresee ... yet). Moving to something else that takes the "hotness" of the sstables into account is the way to go - but not easy. Requires a huge number of metric instances (. Not easy to combine these metrics as a "forecast" for compactions. Yea - basically only the current disk space seems to be usable for this ticket. Will make it like that.
          Hide
          snazy Robert Stupp added a comment -

          IMO that simple "choose disk by free disk space" algorithm" is manageable for 2.0, 2.1 and trunk.
          Going to implement it over the weekend.

          Show
          snazy Robert Stupp added a comment - IMO that simple "choose disk by free disk space" algorithm" is manageable for 2.0, 2.1 and trunk. Going to implement it over the weekend.
          Hide
          snazy Robert Stupp added a comment -

          Attached patches for 2.0 and 2.1.
          Uses weighted random to choose writable directories.
          Also added a parameter writeSize to getWriteableLocation to exclude directories that have insufficient free total disk space.

          Although DataDirectory currently only has one field location, I did not remove that class since we can easily improve blacklisted directories handling by adding AtomicBoolean instances for read+write blacklist marks (CASSANDRA-8324).

          Show
          snazy Robert Stupp added a comment - Attached patches for 2.0 and 2.1. Uses weighted random to choose writable directories. Also added a parameter writeSize to getWriteableLocation to exclude directories that have insufficient free total disk space. Although DataDirectory currently only has one field location , I did not remove that class since we can easily improve blacklisted directories handling by adding AtomicBoolean instances for read+write blacklist marks ( CASSANDRA-8324 ).
          Hide
          snazy Robert Stupp added a comment -

          Yuki Morishita for review v3 ?

          Show
          snazy Robert Stupp added a comment - Yuki Morishita for review v3 ?
          Hide
          aboudreault Alan Boudreault added a comment -

          Lyuben Todorov I am currently testing the patch. What metrics did you use to generate you graphes (per disk)? Thanks

          Show
          aboudreault Alan Boudreault added a comment - Lyuben Todorov I am currently testing the patch. What metrics did you use to generate you graphes (per disk)? Thanks
          Hide
          lyubent Lyuben Todorov added a comment -

          Alan Boudreault It's WriteValueMean (wvm) that I used. The graph shows the wvm drop for the five discs that were un-saturated (they have a higher wvm to begin with during the second run as they are choosen more frequently over the two drives that were previously saturated).

          Show
          lyubent Lyuben Todorov added a comment - Alan Boudreault It's WriteValueMean (wvm) that I used. The graph shows the wvm drop for the five discs that were un-saturated (they have a higher wvm to begin with during the second run as they are choosen more frequently over the two drives that were previously saturated).
          Hide
          snazy Robert Stupp added a comment -

          Alan Boudreault the current patch has nothing like a 'write value' so there are no metrics on it. The "metric" is just 'df -a'.
          Means: it computes the sum of free space over all drives [T] and computes the ratio of each disk's free space [D] using [D/T]. The actual disk is chosen by a weighted random (which just means that dives with more free space have a higher chance to get a file).

          Show
          snazy Robert Stupp added a comment - Alan Boudreault the current patch has nothing like a 'write value' so there are no metrics on it. The "metric" is just 'df -a'. Means: it computes the sum of free space over all drives [T] and computes the ratio of each disk's free space [D] using [D/T] . The actual disk is chosen by a weighted random (which just means that dives with more free space have a higher chance to get a file).
          Hide
          snazy Robert Stupp added a comment -

          Hm - just noticed that o.a.c.io.util.DiskAwareRunnable#runMayThrow and o.a.c.streaming.StreamReader#createWriter handle a returned null value if there's no directory with enough free disk space. But o.a.c.db.Directories#getLocationForDisk used by getWriteableLocationAsFile would NPE (did not occur since null arg never occured).
          The old implementation would always return a directory but this patch throws an IOError (code mentioned above an IOException).
          The loop in o.a.c.io.util.DiskAwareRunnable#runMayThrow seems useless.
          Clean it up in this patch?

          Show
          snazy Robert Stupp added a comment - Hm - just noticed that o.a.c.io.util.DiskAwareRunnable#runMayThrow and o.a.c.streaming.StreamReader#createWriter handle a returned null value if there's no directory with enough free disk space. But o.a.c.db.Directories#getLocationForDisk used by getWriteableLocationAsFile would NPE (did not occur since null arg never occured). The old implementation would always return a directory but this patch throws an IOError (code mentioned above an IOException ). The loop in o.a.c.io.util.DiskAwareRunnable#runMayThrow seems useless. Clean it up in this patch?
          Hide
          yukim Yuki Morishita added a comment -

          Clean it up in this patch?

          Yeah, the thing there is odd, so please.

          Show
          yukim Yuki Morishita added a comment - Clean it up in this patch? Yeah, the thing there is odd, so please.
          Hide
          jjordan Jeremiah Jordan added a comment -

          This never gets called with a value besides -1, so why does it even take a parameter?

          getWriteableLocationAsFile(-1L);
          
          Show
          jjordan Jeremiah Jordan added a comment - This never gets called with a value besides -1, so why does it even take a parameter? getWriteableLocationAsFile(-1L);
          Hide
          snazy Robert Stupp added a comment -

          getWriteableLocationAsFile never gets called with a value besides -1

          yea - just wanted that method to be consistent with the getWriteableLocation method

          Show
          snazy Robert Stupp added a comment - getWriteableLocationAsFile never gets called with a value besides -1 yea - just wanted that method to be consistent with the getWriteableLocation method
          Hide
          aboudreault Alan Boudreault added a comment - - edited

          devs, I've tested this issue with and without the patch and analysed the disk usage of 3 scenarios. The patch works well and fix important issues related to multiple directories. I'm sharing with you the results with the graphes:

          For all my tests, I have been able to reproduce the issues using multiple directories. No need to hammer the node with compaction and repair, I simply limited the concurrent_compactors and the compaction_throughput_mb_per_sec to slow things. This makes the disk busy during the pick selection.

          Test 1

          • 2 Disks of the same size
          • Goal: stress the server to fill all disks
          Result - No Patch

          Only one disk is filled and the other one is never filled. Cassandra-stress crashed with WriteTimeoutException while the second disk remains at ~20% of disk usage.

          test1_no_patch.jpg

          Result - With Patch

          Success. Both disk are filled at approximatively the same speed.

          test1_with_patch.jpg

          Test 2

          • 5 disks total of the same size
          • 2 disks initially filled at ~20%
          • 3 disks added later
          • Goal: stress the server to fill all disks
          Result - No Patch
          • The first 2 disks aren't used at the beginning since they are already at 20% of disk usage. (That's ok)
          • Some new data are written
          • 2 newly added disks are used for the initial data, when they reach 20% of disk usage... all 4 disks are filled at approximatively the same speed.
          • The last disk that is running a compaction is almost never used and remains at 15% of disk usage when cassandra-stress crash with write timeouts.

          test2_no_patch.jpg

          Result - With Patch

          Success. All disks have been filled at approximatively the same speed. I can notice that Cassandra doesn't wait untill all 3 newly added disks are at 20% to re-use the disk 1 and 2, but it keeps things OK and reduce the difference through the run.

          test2_with_patch.jpg

          Test 3

          • 5 disks total.
          • 4 disks of 2G of size
          • 1 disk of 10G of size (5x more than the other ones)
          • Goal: stress the server to fill all disks
          Result - No Patch
          • The disk #5 (10G of size) is initially use then an internal compaction is started.
          • All the 4 other disks are completely filled and the disk 5 is never used anymore. Cassandra-stress crash with write timeout and the disk5 remains at 15% of disk usage with more than 8G of free space.

          test3_no_patch.jpg

          Result - With Patch

          Success. All 5 disks are filled at approximatively the same speed.

          test3_with_patch.jpg

          Show
          aboudreault Alan Boudreault added a comment - - edited devs, I've tested this issue with and without the patch and analysed the disk usage of 3 scenarios. The patch works well and fix important issues related to multiple directories. I'm sharing with you the results with the graphes: For all my tests, I have been able to reproduce the issues using multiple directories. No need to hammer the node with compaction and repair, I simply limited the concurrent_compactors and the compaction_throughput_mb_per_sec to slow things. This makes the disk busy during the pick selection. Test 1 2 Disks of the same size Goal: stress the server to fill all disks Result - No Patch Only one disk is filled and the other one is never filled. Cassandra-stress crashed with WriteTimeoutException while the second disk remains at ~20% of disk usage. test1_no_patch.jpg Result - With Patch Success. Both disk are filled at approximatively the same speed. test1_with_patch.jpg Test 2 5 disks total of the same size 2 disks initially filled at ~20% 3 disks added later Goal: stress the server to fill all disks Result - No Patch The first 2 disks aren't used at the beginning since they are already at 20% of disk usage. (That's ok) Some new data are written 2 newly added disks are used for the initial data, when they reach 20% of disk usage... all 4 disks are filled at approximatively the same speed. The last disk that is running a compaction is almost never used and remains at 15% of disk usage when cassandra-stress crash with write timeouts. test2_no_patch.jpg Result - With Patch Success. All disks have been filled at approximatively the same speed. I can notice that Cassandra doesn't wait untill all 3 newly added disks are at 20% to re-use the disk 1 and 2, but it keeps things OK and reduce the difference through the run. test2_with_patch.jpg Test 3 5 disks total. 4 disks of 2G of size 1 disk of 10G of size (5x more than the other ones) Goal: stress the server to fill all disks Result - No Patch The disk #5 (10G of size) is initially use then an internal compaction is started. All the 4 other disks are completely filled and the disk 5 is never used anymore. Cassandra-stress crash with write timeout and the disk5 remains at 15% of disk usage with more than 8G of free space. test3_no_patch.jpg Result - With Patch Success. All 5 disks are filled at approximatively the same speed. test3_with_patch.jpg
          Hide
          snazy Robert Stupp added a comment -

          Alan Boudreault Cool! Thanks for testing it!

          Show
          snazy Robert Stupp added a comment - Alan Boudreault Cool! Thanks for testing it!
          Hide
          snazy Robert Stupp added a comment -

          Patch v4 comments:

          • Made Directories.getWriteableLocation return null if no writable directory present (to let the calling code react on that).
          • Let all callers pass a valid writeSize which means near-to-full directories can not be returned as targets for large compactions, streams, etc
          • Removed StorageService.requestGC which was referenced in Directories.getDirectoryForNewSSTables but never called (since getWriteableLocation never returned null).

          Yuki Morishita can you review v4 ?

          Show
          snazy Robert Stupp added a comment - Patch v4 comments: Made Directories.getWriteableLocation return null if no writable directory present (to let the calling code react on that). Let all callers pass a valid writeSize which means near-to-full directories can not be returned as targets for large compactions, streams, etc Removed StorageService.requestGC which was referenced in Directories.getDirectoryForNewSSTables but never called (since getWriteableLocation never returned null). Yuki Morishita can you review v4 ?
          Hide
          yukim Yuki Morishita added a comment -

          LGTM, except in the test code:

          // at least (rule of thumb) 100 iterations
          if (i >= 100)
              break;
          // random weighted writeable directory algorithm fails to return all possible directories after
          // many tries
          if (i >= 10000000)
              fail();
          

          I think you mean to check the second `if` first?

          Show
          yukim Yuki Morishita added a comment - LGTM, except in the test code: // at least (rule of thumb) 100 iterations if (i >= 100) break ; // random weighted writeable directory algorithm fails to return all possible directories after // many tries if (i >= 10000000) fail(); I think you mean to check the second `if` first?
          Hide
          snazy Robert Stupp added a comment -

          Patch v5 fixes the junit (only change).

          Yeah - that was weird. Intention was to check at least 100 iterations (to see whether it gets wrong) and at max "some more" iterations to give it a very high chance to succeed.

          Show
          snazy Robert Stupp added a comment - Patch v5 fixes the junit (only change). Yeah - that was weird. Intention was to check at least 100 iterations (to see whether it gets wrong) and at max "some more" iterations to give it a very high chance to succeed.
          Hide
          yukim Yuki Morishita added a comment -

          Committed v5, thanks!

          Show
          yukim Yuki Morishita added a comment - Committed v5, thanks!
          Hide
          aboudreault Alan Boudreault added a comment -

          Robert Stupp Yuki Morishita, while doing tests for CASSANDRA-8329, I've noticed an important regression related to this patch. I'll give you more info and graphes asap tomorrow morning.

          Show
          aboudreault Alan Boudreault added a comment - Robert Stupp Yuki Morishita , while doing tests for CASSANDRA-8329 , I've noticed an important regression related to this patch. I'll give you more info and graphes asap tomorrow morning.
          Hide
          aboudreault Alan Boudreault added a comment - - edited

          Devs, this is the result of my regression test without and with the patch.

          Note: the compaction concurrency is set to 4 and the throughput unlimited.

          Test

          • 12 disks total of 2G of size.
          • Goal: run the following command to fill the disks:
            cassandra-stress WRITE n=2000000 -col size=FIXED(1000) -mode native prepared cql3 -schema keyspace=r1
          Result - No Patch

          test_regression_no_patch.jpg

          All disk are filled in ~6 minutes Casandra-stress crashed with write timeouts at around n=650000

          Result - With Patch

          test_regression_with_patch.jpg

          Cassandra-stress finished all its work (~13 minutes, n=2000000) and all disks are under 60% of disk usage.

          Any idea what's going on? Am I doing something wrong in my test case?

          Show
          aboudreault Alan Boudreault added a comment - - edited Devs, this is the result of my regression test without and with the patch. Note: the compaction concurrency is set to 4 and the throughput unlimited. Test 12 disks total of 2G of size. Goal: run the following command to fill the disks: cassandra-stress WRITE n=2000000 -col size=FIXED(1000) -mode native prepared cql3 -schema keyspace=r1 Result - No Patch test_regression_no_patch.jpg All disk are filled in ~6 minutes Casandra-stress crashed with write timeouts at around n=650000 Result - With Patch test_regression_with_patch.jpg Cassandra-stress finished all its work (~13 minutes, n=2000000) and all disks are under 60% of disk usage. Any idea what's going on? Am I doing something wrong in my test case?
          Hide
          snazy Robert Stupp added a comment -

          Alan Boudreault I guess you mean the disks 2 +7 in test_regression_with_patch.jpg.
          I assume that disk2 got a lot of new sstables shortly after second 221 that disk7 got another bunch of sstables just before second 701 - just because these were the "foolish" disks that were nearly empty.
          It might be a consequence of "unlimited compaction throughput" - can you verify that with a "conservative" compaction thoughput?
          Maybe we have to (reintroduce) reservation of disk space - it's not a big deal to implement that and provide a patch this evening (CET) so that you can verify it.

          Show
          snazy Robert Stupp added a comment - Alan Boudreault I guess you mean the disks 2 +7 in test_regression_with_patch.jpg. I assume that disk2 got a lot of new sstables shortly after second 221 that disk7 got another bunch of sstables just before second 701 - just because these were the "foolish" disks that were nearly empty. It might be a consequence of "unlimited compaction throughput" - can you verify that with a "conservative" compaction thoughput? Maybe we have to (reintroduce) reservation of disk space - it's not a big deal to implement that and provide a patch this evening (CET) so that you can verify it.
          Hide
          aboudreault Alan Boudreault added a comment - - edited

          Robert Stupp In fact, my concern is not really the 2 full disks.... but more why can I fill all my disks in 6 minutes without the patch and that with the patch, 7/9 of my disks are under 60% of usage after 15 minutes? I might be wrong since that stuff is new to me..... but is there some better compaction/compression happening with your patch or was there something wrong happening before? Thanks!

          Yes, will try with a conservative compaction throughput, like 16mb/s (default).

          Show
          aboudreault Alan Boudreault added a comment - - edited Robert Stupp In fact, my concern is not really the 2 full disks.... but more why can I fill all my disks in 6 minutes without the patch and that with the patch, 7/9 of my disks are under 60% of usage after 15 minutes? I might be wrong since that stuff is new to me..... but is there some better compaction/compression happening with your patch or was there something wrong happening before? Thanks! Yes, will try with a conservative compaction throughput, like 16mb/s (default).
          Hide
          snazy Robert Stupp added a comment -

          Hm ... I see. Have you seen any error in system.log during the no-patch run (beside "disk full")? Or any unfinished compactions? IMO the "with-patch" graph shows typical "compaction spikes" - but the "no-patch" graph doesn't.
          The patch itself has no direct influence on compactions - but since disk assignment is influenced by the patch, it has some influence.

          Show
          snazy Robert Stupp added a comment - Hm ... I see. Have you seen any error in system.log during the no-patch run (beside "disk full")? Or any unfinished compactions? IMO the "with-patch" graph shows typical "compaction spikes" - but the "no-patch" graph doesn't. The patch itself has no direct influence on compactions - but since disk assignment is influenced by the patch, it has some influence.
          Hide
          jjordan Jeremiah Jordan added a comment -

          Alan Boudreault the fact that without the patch you crash is what this issue is trying to fix. So that is a GOOD thing that it happens without, but the patch fixes it.

          Robert Stupp nothing to change here, and no we do not want to bring back disk reservation, that only caused problems.

          Show
          jjordan Jeremiah Jordan added a comment - Alan Boudreault the fact that without the patch you crash is what this issue is trying to fix. So that is a GOOD thing that it happens without, but the patch fixes it. Robert Stupp nothing to change here, and no we do not want to bring back disk reservation, that only caused problems.
          Hide
          aboudreault Alan Boudreault added a comment -

          Yep, discussed with Jeremiah on hipchat, he clarified things. Thanks! Closing.

          Show
          aboudreault Alan Boudreault added a comment - Yep, discussed with Jeremiah on hipchat, he clarified things. Thanks! Closing.
          Hide
          snazy Robert Stupp added a comment -

          we do not want to bring back disk reservation

          good to hear

          Jeremiah Jordan just for me to understand it - what were there problems?

          Show
          snazy Robert Stupp added a comment - we do not want to bring back disk reservation good to hear Jeremiah Jordan just for me to understand it - what were there problems?
          Hide
          aboudreault Alan Boudreault added a comment - - edited

          Robert Stupp From what I understand, the whole compaction process crashed as soon as it hit 1 disk full. So, no more compaction was happening then. This makes sense since that in my prior tests, I just make the compaction process very very slow, so nothing was crashing. Jeremiah Jordan can confirm if I'm right here.

          Yuki Morishita Will this be backported in branch cassandra-2.0? Thanks

          Show
          aboudreault Alan Boudreault added a comment - - edited Robert Stupp From what I understand, the whole compaction process crashed as soon as it hit 1 disk full. So, no more compaction was happening then. This makes sense since that in my prior tests, I just make the compaction process very very slow, so nothing was crashing. Jeremiah Jordan can confirm if I'm right here. Yuki Morishita Will this be backported in branch cassandra-2.0? Thanks
          Hide
          jjordan Jeremiah Jordan added a comment -

          what were there problems?

          The problem was without the patch, the test hit the issue the patch was meant to fix... one disk filling up completely and crashing things.

          Show
          jjordan Jeremiah Jordan added a comment - what were there problems? The problem was without the patch, the test hit the issue the patch was meant to fix... one disk filling up completely and crashing things.
          Hide
          snazy Robert Stupp added a comment -

          one disk filling up completely and crashing things

          just because of reservation? oops

          Show
          snazy Robert Stupp added a comment - one disk filling up completely and crashing things just because of reservation? oops
          Hide
          jjordan Jeremiah Jordan added a comment -

          just because of reservation? oops

          oh sorry, mis-understood you. The problem with reservations was CASSANDRA-5605. We would end up reserving the whole disk, so flushing couldn't happen, and then the heap would fill up, and you would OOM. Because we reserved the max possible space, but for workloads with overwrites, the resulting file is way smaller than the reservation, so we didn't actually need all that space. Basically we were declaring "disk full", before the disk was actually full. The other problem is that when reserving space, if multiple compactions are in progress, you reserve the max needed by all of them. But they finish at different times, and when they finish all the old files get removed. So again you are declaring "disk full" when it will not actually be full.

          Show
          jjordan Jeremiah Jordan added a comment - just because of reservation? oops oh sorry, mis-understood you. The problem with reservations was CASSANDRA-5605 . We would end up reserving the whole disk, so flushing couldn't happen, and then the heap would fill up, and you would OOM. Because we reserved the max possible space, but for workloads with overwrites, the resulting file is way smaller than the reservation, so we didn't actually need all that space. Basically we were declaring "disk full", before the disk was actually full. The other problem is that when reserving space, if multiple compactions are in progress, you reserve the max needed by all of them. But they finish at different times, and when they finish all the old files get removed. So again you are declaring "disk full" when it will not actually be full.
          Hide
          snazy Robert Stupp added a comment -

          Thanks

          Show
          snazy Robert Stupp added a comment - Thanks
          Hide
          yukim Yuki Morishita added a comment -

          Backported to 2.0.12 as well. (8b5cf64043e2d002fdb91921319110911e332042)

          Show
          yukim Yuki Morishita added a comment - Backported to 2.0.12 as well. (8b5cf64043e2d002fdb91921319110911e332042)

            People

            • Assignee:
              snazy Robert Stupp
              Reporter:
              cnlwsu Chris Lohfink
              Reviewer:
              Yuki Morishita
              Tester:
              Alan Boudreault
            • Votes:
              5 Vote for this issue
              Watchers:
              21 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development