Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-4781

Sometimes Cassandra starts compacting system-shema_columns cf repeatedly until the node is killed

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Fix Version/s: 1.2.0 beta 2
    • Component/s: None
    • Labels:
      None
    • Environment:

      Ubuntu 12.04, single-node Cassandra cluster

      Description

      Cassandra starts flushing system-schema_columns cf in a seemingly infinite loop:

      INFO [CompactionExecutor:7] 2012-10-09 17:55:46,804 CompactionTask.java (line 239) Compacted to [/var/lib/cassandra/data/system/schema_columns/system-schema_columns-ia-32107-Data.db,]. 3,827 to 3,827 (~100% of original) bytes for 3 keys at 0.202762MB/s. Time: 18ms.
      INFO [CompactionExecutor:7] 2012-10-09 17:55:46,804 CompactionTask.java (line 119) Compacting [SSTableReader(path='/var/lib/cassandra/data/system/schema_columns/system-schema_columns-ia-32107-Data.db')]
      INFO [CompactionExecutor:7] 2012-10-09 17:55:46,824 CompactionTask.java (line 239) Compacted to [/var/lib/cassandra/data/system/schema_columns/system-schema_columns-ia-32108-Data.db,]. 3,827 to 3,827 (~100% of original) bytes for 3 keys at 0.182486MB/s. Time: 20ms.
      INFO [CompactionExecutor:7] 2012-10-09 17:55:46,825 CompactionTask.java (line 119) Compacting [SSTableReader(path='/var/lib/cassandra/data/system/schema_columns/system-schema_columns-ia-32108-Data.db')]
      INFO [CompactionExecutor:7] 2012-10-09 17:55:46,864 CompactionTask.java (line 239) Compacted to [/var/lib/cassandra/data/system/schema_columns/system-schema_columns-ia-32109-Data.db,]. 3,827 to 3,827 (~100% of original) bytes for 3 keys at 0.096045MB/s. Time: 38ms.
      INFO [CompactionExecutor:7] 2012-10-09 17:55:46,864 CompactionTask.java (line 119) Compacting [SSTableReader(path='/var/lib/cassandra/data/system/schema_columns/system-schema_columns-ia-32109-Data.db')]
      INFO [CompactionExecutor:7] 2012-10-09 17:55:46,894 CompactionTask.java (line 239) Compacted to [/var/lib/cassandra/data/system/schema_columns/system-schema_columns-ia-32110-Data.db,]. 3,827 to 3,827 (~100% of original) bytes for 3 keys at 0.121657MB/s. Time: 30ms.
      INFO [CompactionExecutor:7] 2012-10-09 17:55:46,894 CompactionTask.java (line 119) Compacting [SSTableReader(path='/var/lib/cassandra/data/system/schema_columns/system-schema_columns-ia-32110-Data.db')]
      INFO [CompactionExecutor:7] 2012-10-09 17:55:46,914 CompactionTask.java (line 239) Compacted to [/var/lib/cassandra/data/system/schema_columns/system-schema_columns-ia-32111-Data.db,]. 3,827 to 3,827 (~100% of original) bytes for 3 keys at 0.202762MB/s. Time: 18ms.
      INFO [CompactionExecutor:7] 2012-10-09 17:55:46,914 CompactionTask.java (line 119) Compacting [SSTableReader(path='/var/lib/cassandra/data/system/schema_columns/system-schema_columns-ia-32111-Data.db')]
      .........

      Don't know what's causing it. Don't know a way to predictably trigger this behaviour. It just happens sometimes.

      1. 4781.txt
        1 kB
        Yuki Morishita
      2. 4781-v2.txt
        12 kB
        Yuki Morishita
      3. 4781-v3.txt
        12 kB
        Yuki Morishita

        Activity

        Hide
        brandon.williams Brandon Williams added a comment -

        Here is something possibly related that is easy to reproduce. Given nodes X, Y and Z, start X. Now start Y, and see X flush schema_columns and schema_keyspaces in addition to peers/local, even though there's no schema and nothing changed. Now start Z and see both X and Y exhibit this behavior, and sometimes even flush the schema twice in a row.

        Show
        brandon.williams Brandon Williams added a comment - Here is something possibly related that is easy to reproduce. Given nodes X, Y and Z, start X. Now start Y, and see X flush schema_columns and schema_keyspaces in addition to peers/local, even though there's no schema and nothing changed. Now start Z and see both X and Y exhibit this behavior, and sometimes even flush the schema twice in a row.
        Hide
        brandon.williams Brandon Williams added a comment -

        Interesting to note that if you hit this and then add nodes to the cluster, they will inherit the problem too. Issuing a schema changes seems to stop the infinite compactions.

        Show
        brandon.williams Brandon Williams added a comment - Interesting to note that if you hit this and then add nodes to the cluster, they will inherit the problem too. Issuing a schema changes seems to stop the infinite compactions.
        Hide
        yukim Yuki Morishita added a comment -

        I found the cause of this recursive compaction.

        We calculate how many columns in sstable are expected to be dropped, and if that exceeds threshold we do single sstable compaction. Cassandra does not drop tombstones when overlapping keys exist in other sstables, so when we have overlap, we calculate what percentage of columns that do not belong to overlapped keys can be dropped.
        Here we use sstable.estimatedKeys to calculate, but since the value comes from index interval and index summary, we have chance to calculate wrong value if the number of keys in sstable is small, like in this case, schema_columns. (schema_columns also has lot of overlapped keys among sstable since its key is Keyspace name.)

        So I propose to skip single sstable compaction if sstable contains small number of keys.
        Patch is attached for this.

        Show
        yukim Yuki Morishita added a comment - I found the cause of this recursive compaction. We calculate how many columns in sstable are expected to be dropped, and if that exceeds threshold we do single sstable compaction. Cassandra does not drop tombstones when overlapping keys exist in other sstables, so when we have overlap, we calculate what percentage of columns that do not belong to overlapped keys can be dropped. Here we use sstable.estimatedKeys to calculate, but since the value comes from index interval and index summary, we have chance to calculate wrong value if the number of keys in sstable is small, like in this case, schema_columns. (schema_columns also has lot of overlapped keys among sstable since its key is Keyspace name.) So I propose to skip single sstable compaction if sstable contains small number of keys. Patch is attached for this.
        Hide
        slebresne Sylvain Lebresne added a comment -

        I believe you are right that this is a problem. But I think there is another problem in that computation (that do not only impact small number of keys), namely in the estimation of remaining columns:

        long columns = sstable.getEstimatedColumnCount().percentile(remainingKeysRatio) * remainingKeys;
        

        I think the use of percentile here is not correct. For instance, say the remaingKeysRatio is very big (say 99%), and say that your rows are such that you have many small rows and a handful (5%) of very big ones. In that case, percentile will give you the number of columns the very big row have (it will give you a number such that 99% of the rows have less than this number of columns), and you'll end up with an estimate of columns that is way off (that is, you could end up with a number of remaining column that is order of magnitude bigger than the total number of columns). I believe we should simply use:

        long columns = sstable.getEstimatedColumnCount().mean() * remainingKeys;
        

        For the estimated key number, I'm good with going with your solution, but an alternative one would be to use a more conservative estimated key number that would be:

        public long conservativeKeyEstimate()
        {
            return indexSummary.getKeys().size() < 2
                 ? 1
                 : (indexSummary.getKeys().size() - 1) * DatabaseDescriptor.getIndexInterval();
        }
        

        That advantage being that this would always under-estimate the number of keys, while estimatedKeys() always over-estimate it, which seems a better option here because we don't have a choose a rather random value of minimum samples after which we consider that the over-estimation is "acceptable" in proportion.

        But all this being said, and while we should definitively fix the things above, they will only make the estimation better, but it still an estimation. So at least in theory, we could always end up in a case where the estimate thinks there is enough droppable tombstones, but in practice all the droppable tombstones are in overlapping ranges. Meaning that I'd suggest skipping the worthDroppingTombstones check for sstables that have been compacted (using the creation time of the file is probably good enough) since less than some time threshold (say maybe gcGrace/4). After all, if I've just been compacted and still have a high ratio of droppable, it's probably that those are in fact not droppable due to overlapping sstables.

        Show
        slebresne Sylvain Lebresne added a comment - I believe you are right that this is a problem. But I think there is another problem in that computation (that do not only impact small number of keys), namely in the estimation of remaining columns: long columns = sstable.getEstimatedColumnCount().percentile(remainingKeysRatio) * remainingKeys; I think the use of percentile here is not correct. For instance, say the remaingKeysRatio is very big (say 99%), and say that your rows are such that you have many small rows and a handful (5%) of very big ones. In that case, percentile will give you the number of columns the very big row have (it will give you a number such that 99% of the rows have less than this number of columns), and you'll end up with an estimate of columns that is way off (that is, you could end up with a number of remaining column that is order of magnitude bigger than the total number of columns). I believe we should simply use: long columns = sstable.getEstimatedColumnCount().mean() * remainingKeys; For the estimated key number, I'm good with going with your solution, but an alternative one would be to use a more conservative estimated key number that would be: public long conservativeKeyEstimate() { return indexSummary.getKeys().size() < 2 ? 1 : (indexSummary.getKeys().size() - 1) * DatabaseDescriptor.getIndexInterval(); } That advantage being that this would always under-estimate the number of keys, while estimatedKeys() always over-estimate it, which seems a better option here because we don't have a choose a rather random value of minimum samples after which we consider that the over-estimation is "acceptable" in proportion. But all this being said, and while we should definitively fix the things above, they will only make the estimation better, but it still an estimation. So at least in theory, we could always end up in a case where the estimate thinks there is enough droppable tombstones, but in practice all the droppable tombstones are in overlapping ranges. Meaning that I'd suggest skipping the worthDroppingTombstones check for sstables that have been compacted (using the creation time of the file is probably good enough) since less than some time threshold (say maybe gcGrace/4). After all, if I've just been compacted and still have a high ratio of droppable, it's probably that those are in fact not droppable due to overlapping sstables.
        Hide
        yukim Yuki Morishita added a comment -

        Sylvain, thanks for the feedback. v2 attached.

        I changed use of percentile to mean. For estimated keys, I just kept mine since estimatedKeysForRanges might return greater value than under estimated keys count.

        But all this being said, and while we should definitively fix the things above, they will only make the estimation better, but it still an estimation. So at least in theory, we could always end up in a case where the estimate thinks there is enough droppable tombstones, but in practice all the droppable tombstones are in overlapping ranges.

        You are right, we need safe stopper. I followed your advice of using sstable creation time. But I made time threshold configurable via compaction strategy option with default value of 5 minutes, since gcGraceSecond can be set to 0, and makes testing easy.

        Speaking of schema_* columnfamilies, can we lower minimum compaction threshold to 2? schema_* have almost always overlapped key and tombstones, and less likely to updated.

        Show
        yukim Yuki Morishita added a comment - Sylvain, thanks for the feedback. v2 attached. I changed use of percentile to mean. For estimated keys, I just kept mine since estimatedKeysForRanges might return greater value than under estimated keys count. But all this being said, and while we should definitively fix the things above, they will only make the estimation better, but it still an estimation. So at least in theory, we could always end up in a case where the estimate thinks there is enough droppable tombstones, but in practice all the droppable tombstones are in overlapping ranges. You are right, we need safe stopper. I followed your advice of using sstable creation time. But I made time threshold configurable via compaction strategy option with default value of 5 minutes, since gcGraceSecond can be set to 0, and makes testing easy. Speaking of schema_* columnfamilies, can we lower minimum compaction threshold to 2? schema_* have almost always overlapped key and tombstones, and less likely to updated.
        Hide
        slebresne Sylvain Lebresne added a comment -

        Looks overall ok, but a few small remaining remarks:

        • I'm good with making the "tombstone compaction interval" configurable but I would have gone with a longer default one (say a day, or at least a few hours). The tombstone compaction can be fairly improductive if we do them too often: compacting a sstable for tombstones, even if you do collect tombstone, is kinda bad if you're going to compact the sstable a short time later. In other words, I don't think this interval is only useful for avoiding an infinite loop. Besides, if it's configurable, the few people that have very heavy delete/expiring workload can set that lower if that helps them.
        • It's probably not worth exposing an interval in milliseconds, seconds would be more than good enough. I also don't dislike putting the unit the option use in the name too, so "tombstone_compaction_interval_seconds", though maybe it's too long a name.
        • Could be nice to validate the user input for the new option (should be > 0).
        Show
        slebresne Sylvain Lebresne added a comment - Looks overall ok, but a few small remaining remarks: I'm good with making the "tombstone compaction interval" configurable but I would have gone with a longer default one (say a day, or at least a few hours). The tombstone compaction can be fairly improductive if we do them too often: compacting a sstable for tombstones, even if you do collect tombstone, is kinda bad if you're going to compact the sstable a short time later. In other words, I don't think this interval is only useful for avoiding an infinite loop. Besides, if it's configurable, the few people that have very heavy delete/expiring workload can set that lower if that helps them. It's probably not worth exposing an interval in milliseconds, seconds would be more than good enough. I also don't dislike putting the unit the option use in the name too, so "tombstone_compaction_interval_seconds", though maybe it's too long a name. Could be nice to validate the user input for the new option (should be > 0).
        Hide
        yukim Yuki Morishita added a comment -

        v3 attached. Default interval is now 1 day in second unit with validation.

        Show
        yukim Yuki Morishita added a comment - v3 attached. Default interval is now 1 day in second unit with validation.
        Hide
        minaguib Mina Naguib added a comment -

        FWIW I've just hit this bug, on the very first startup of a node after it was upgraded from 1.1.2 to 1.1.6

        Show
        minaguib Mina Naguib added a comment - FWIW I've just hit this bug, on the very first startup of a node after it was upgraded from 1.1.2 to 1.1.6
        Hide
        minaguib Mina Naguib added a comment -

        Restarting the node produced the same endless loop. Downgrading back to 1.1.2 fixed it.

        Show
        minaguib Mina Naguib added a comment - Restarting the node produced the same endless loop. Downgrading back to 1.1.2 fixed it.
        Hide
        slebresne Sylvain Lebresne added a comment -

        Alright, v3 lgtm, +1 (I had in mind of throwing an exception if compaction_compaction_interval < 0 but I realize that it's slightly more complex than I though and not only related to that option so let's leave it like that here and I'll open a separate ticket for improving validation).

        @Mina you cannot have hit this bug in 1.1.6 because it concerns code that is 1.2.0 only. So if you had a problem upgrading, that would be something else.

        Show
        slebresne Sylvain Lebresne added a comment - Alright, v3 lgtm, +1 (I had in mind of throwing an exception if compaction_compaction_interval < 0 but I realize that it's slightly more complex than I though and not only related to that option so let's leave it like that here and I'll open a separate ticket for improving validation). @Mina you cannot have hit this bug in 1.1.6 because it concerns code that is 1.2.0 only. So if you had a problem upgrading, that would be something else.
        Hide
        minaguib Mina Naguib added a comment -

        @Sylvain Lebresne Here's the relevant portion of cassandra 1.1.6 starting up (after loading the SStables and replaying the commitlog): http://mina.naguib.ca/misc/cassandra_116_startup_loop.txt

        It loops and never ends until I kill cassandra.

        As I've said, restarting cassandra repeated the behavior. Downgrading from 1.1.6 to 1.1.2 resolved it.

        If you still feel it's unrelated to this ticket, let me know and I'll file a new one.

        Show
        minaguib Mina Naguib added a comment - @ Sylvain Lebresne Here's the relevant portion of cassandra 1.1.6 starting up (after loading the SStables and replaying the commitlog): http://mina.naguib.ca/misc/cassandra_116_startup_loop.txt It loops and never ends until I kill cassandra. As I've said, restarting cassandra repeated the behavior. Downgrading from 1.1.6 to 1.1.2 resolved it. If you still feel it's unrelated to this ticket, let me know and I'll file a new one.
        Hide
        slebresne Sylvain Lebresne added a comment -

        If you still feel it's unrelated to this ticket

        I feel nothing, I know it's unrelated. It cannot be related to this ticket since that ticket is due to code that does not exist in 1.1. Besides, your log shows System.schema_keyspaces memtables being flush over and over again (and they are not even empty) so that is what triggers compaction. Why you get continuous activity on the schema_keyspaces CF is a good question though, but definitively not related to this ticket.

        Show
        slebresne Sylvain Lebresne added a comment - If you still feel it's unrelated to this ticket I feel nothing, I know it's unrelated. It cannot be related to this ticket since that ticket is due to code that does not exist in 1.1. Besides, your log shows System.schema_keyspaces memtables being flush over and over again (and they are not even empty) so that is what triggers compaction. Why you get continuous activity on the schema_keyspaces CF is a good question though, but definitively not related to this ticket.
        Hide
        yukim Yuki Morishita added a comment -

        Committed v3 to trunk.

        Mina Naguib I'm closing this one because it was caused by functionality only available in trunk (as Sylvain stated) and I pushed the fix to trunk. Feel free to open another ticket for investigation in 1.1 version.

        Show
        yukim Yuki Morishita added a comment - Committed v3 to trunk. Mina Naguib I'm closing this one because it was caused by functionality only available in trunk (as Sylvain stated) and I pushed the fix to trunk. Feel free to open another ticket for investigation in 1.1 version.
        Hide
        brandon.williams Brandon Williams added a comment -

        Mina Naguib your issue is slightly different; there is constant flushing of the schema too.

        Show
        brandon.williams Brandon Williams added a comment - Mina Naguib your issue is slightly different; there is constant flushing of the schema too.
        Hide
        minaguib Mina Naguib added a comment -

        I've filed CASSANDRA-4880

        Show
        minaguib Mina Naguib added a comment - I've filed CASSANDRA-4880

          People

          • Assignee:
            yukim Yuki Morishita
            Reporter:
            iamaleksey Aleksey Yeschenko
            Reviewer:
            Sylvain Lebresne
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development