Cassandra
  1. Cassandra
  2. CASSANDRA-4022

Compaction of hints can get stuck in a loop

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Fix Version/s: 1.2.0 beta 1
    • Component/s: Core
    • Labels:
      None

      Description

      Not exactly sure how I caused this as I was working on something else in trunk, but:

       INFO 17:41:35,682 Compacting [SSTableReader(path='/var/lib/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-hd-339-Data.db')]
       INFO 17:41:36,430 Compacted to [/var/lib/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-hd-340-Data.db,].  4,637,160 to 4,637,160 (~100% of original) bytes 
      for 1 keys at 5.912220MB/s.  Time: 748ms.
       INFO 17:41:36,431 Compacting [SSTableReader(path='/var/lib/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-hd-340-Data.db')]
       INFO 17:41:37,238 Compacted to [/var/lib/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-hd-341-Data.db,].  4,637,160 to 4,637,160 (~100% of original) bytes 
      for 1 keys at 5.479976MB/s.  Time: 807ms.
       INFO 17:41:37,239 Compacting [SSTableReader(path='/var/lib/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-hd-341-Data.db')]
       INFO 17:41:38,163 Compacted to [/var/lib/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-hd-342-Data.db,].  4,637,160 to 4,637,160 (~100% of original) bytes 
      for 1 keys at 4.786083MB/s.  Time: 924ms.
       INFO 17:41:38,164 Compacting [SSTableReader(path='/var/lib/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-hd-342-Data.db')]
       INFO 17:41:39,014 GC for ParNew: 274 ms for 1 collections, 541261288 used; max is 1024458752
       INFO 17:41:39,151 Compacted to [/var/lib/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-hd-343-Data.db,].  4,637,160 to 4,637,160 (~100% of original) bytes 
      for 1 keys at 4.485132MB/s.  Time: 986ms.
       INFO 17:41:39,151 Compacting [SSTableReader(path='/var/lib/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-hd-343-Data.db')]
       INFO 17:41:40,016 GC for ParNew: 308 ms for 1 collections, 585582200 used; max is 1024458752
       INFO 17:41:40,200 Compacted to [/var/lib/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-hd-344-Data.db,].  4,637,160 to 4,637,160 (~100% of original) bytes 
      for 1 keys at 4.223821MB/s.  Time: 1,047ms.
       INFO 17:41:40,201 Compacting [SSTableReader(path='/var/lib/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-hd-344-Data.db')]
       INFO 17:41:41,017 GC for ParNew: 252 ms for 1 collections, 617877904 used; max is 1024458752
       INFO 17:41:41,178 Compacted to [/var/lib/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-hd-345-Data.db,].  4,637,160 to 4,637,160 (~100% of original) bytes 
      for 1 keys at 4.526449MB/s.  Time: 977ms.
       INFO 17:41:41,179 Compacting [SSTableReader(path='/var/lib/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-hd-345-Data.db')]
       INFO 17:41:41,885 Compacted to [/var/lib/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-hd-346-Data.db,].  4,637,160 to 4,637,160 (~100% of original) bytes 
      for 1 keys at 6.263938MB/s.  Time: 706ms.
       INFO 17:41:41,887 Compacting [SSTableReader(path='/var/lib/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-hd-346-Data.db')]
       INFO 17:41:42,617 Compacted to [/var/lib/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-hd-347-Data.db,].  4,637,160 to 4,637,160 (~100% of original) bytes for 1 keys at 6.066311MB/s.  Time: 729ms.
       INFO 17:41:42,618 Compacting [SSTableReader(path='/var/lib/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-hd-347-Data.db')]
       INFO 17:41:43,376 Compacted to [/var/lib/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-hd-348-Data.db,].  4,637,160 to 4,637,160 (~100% of original) bytes for 1 keys at 5.834222MB/s.  Time: 758ms.
       INFO 17:41:43,377 Compacting [SSTableReader(path='/var/lib/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-hd-348-Data.db')]
       INFO 17:41:44,307 Compacted to [/var/lib/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-hd-349-Data.db,].  4,637,160 to 4,637,160 (~100% of original) bytes for 1 keys at 4.760323MB/s.  Time: 929ms.
       INFO 17:41:44,308 Compacting [SSTableReader(path='/var/lib/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-hd-349-Data.db')]
       INFO 17:41:45,021 GC for ParNew: 245 ms for 1 collections, 731287832 used; max is 1024458752
       INFO 17:41:45,316 Compacted to [/var/lib/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-hd-350-Data.db,].  4,637,160 to 4,637,160 (~100% of original) bytes for 1 keys at 4.395965MB/s.  Time: 1,006ms.
       INFO 17:41:45,316 Compacting [SSTableReader(path='/var/lib/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-hd-350-Data.db')]
       INFO 17:41:46,022 GC for ParNew: 353 ms for 1 collections, 757476872 used; max is 1024458752
       INFO 17:41:46,451 Compacted to [/var/lib/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-hd-351-Data.db,].  4,637,160 to 4,637
      

      I suspect we broke something subtle in CASSANDRA-3955.

      1. 4022-v2.txt
        8 kB
        Yuki Morishita
      2. 4022.txt
        4 kB
        Yuki Morishita

        Activity

        Hide
        Jonathan Ellis added a comment -

        rebased + committed

        Show
        Jonathan Ellis added a comment - rebased + committed
        Hide
        Jonathan Ellis added a comment -

        v2 lgtm but I'm going to rebase on top of CASSANDRA-4080 before committing.

        Show
        Jonathan Ellis added a comment - v2 lgtm but I'm going to rebase on top of CASSANDRA-4080 before committing.
        Hide
        Jonathan Ellis added a comment -

        Dropping tombstone is only done when the key tombstone belongs to does not appear in other sstables that are not compacting(ConpactionController#shouldPurge).

        Right. I'm saying we can make that more sophisticated, e.g. by changing shouldPurge signature to (key, maxTombstoneTimestamp) which we could then compare to the min timestamp from the overlapping-but-not-compaction-participant sstables.

        But, thinking about that more, it's unlikely to help much since both STCS and LCS mix data of different ages together routinely. So I'll let that drop now.

        Show
        Jonathan Ellis added a comment - Dropping tombstone is only done when the key tombstone belongs to does not appear in other sstables that are not compacting(ConpactionController#shouldPurge). Right. I'm saying we can make that more sophisticated, e.g. by changing shouldPurge signature to (key, maxTombstoneTimestamp) which we could then compare to the min timestamp from the overlapping-but-not-compaction-participant sstables. But, thinking about that more, it's unlikely to help much since both STCS and LCS mix data of different ages together routinely. So I'll let that drop now.
        Hide
        Yuki Morishita added a comment -

        Dropping tombstone is only done when the key tombstone belongs to does not appear in other sstables that are not compacting(ConpactionController#shouldPurge).
        We may be able to tweak the above to look through timestamp of columns, but it will cost too much.

        Instead, I come up with "guessing" how many tombstones there are outside of overlapping keys among sstables. When sstable has droppable tombstone ratio > threshold but overlaps keys with others, then calculate:

        (# of keys outside of overlap) remainingKeys = sstable.estimatedKeys - sstable.estimatedKeysForRanges(overlapped range)
        (# of columns outside of overlap) remainingColumns = sstable.estimatedColumnCount.percentile(remaingingKeys / total keys) * remainingKeys
        

        and if (remainingColumns / total columns) * (droppable tombstone ratio) is greater than threshold, compact that sstable itself.

        I think in this way, the chance of single sstable compaction increases, while avoiding recursive sstable compaction.

        Show
        Yuki Morishita added a comment - Dropping tombstone is only done when the key tombstone belongs to does not appear in other sstables that are not compacting(ConpactionController#shouldPurge). We may be able to tweak the above to look through timestamp of columns, but it will cost too much. Instead, I come up with "guessing" how many tombstones there are outside of overlapping keys among sstables. When sstable has droppable tombstone ratio > threshold but overlaps keys with others, then calculate: (# of keys outside of overlap) remainingKeys = sstable.estimatedKeys - sstable.estimatedKeysForRanges(overlapped range) (# of columns outside of overlap) remainingColumns = sstable.estimatedColumnCount.percentile(remaingingKeys / total keys) * remainingKeys and if (remainingColumns / total columns) * (droppable tombstone ratio) is greater than threshold, compact that sstable itself. I think in this way, the chance of single sstable compaction increases, while avoiding recursive sstable compaction.
        Hide
        Jonathan Ellis added a comment -

        Suppose for example we have two sstables:

        SSTable A has a tombstone for row K, column foo, at time=100.

        SSTable B has data for row K, column bar, at time=200.

        We would like to allow A to be compacted by itself to get rid of tombstones, since even though B has overlapping data it is new enough that removing the TS is safe.

        Show
        Jonathan Ellis added a comment - Suppose for example we have two sstables: SSTable A has a tombstone for row K, column foo, at time=100. SSTable B has data for row K, column bar, at time=200. We would like to allow A to be compacted by itself to get rid of tombstones, since even though B has overlapping data it is new enough that removing the TS is safe.
        Hide
        Jonathan Ellis added a comment -

        Right, if there's no overlap we're free to compact – but I'm worried that with SizeTiered compaction we'll have overlap in a lot of cases where we could still compact if we looked closer.

        Show
        Jonathan Ellis added a comment - Right, if there's no overlap we're free to compact – but I'm worried that with SizeTiered compaction we'll have overlap in a lot of cases where we could still compact if we looked closer.
        Hide
        Yuki Morishita added a comment -

        I understand the situation, but isn't it covered by just checking key overlap?
        If there is no overlap, then tombstones in target sstable are guaranteed to be the only and the newest ones?

        Show
        Yuki Morishita added a comment - I understand the situation, but isn't it covered by just checking key overlap? If there is no overlap, then tombstones in target sstable are guaranteed to be the only and the newest ones?
        Hide
        Jonathan Ellis added a comment -

        We don't need to suppress tombstones – we just need to make sure that any data that the tombstones we're compacting, is new enough that we don't need the tombstones to suppress them. In other words, that throwing away our tombstones won't make deleted data start showing up again.

        Show
        Jonathan Ellis added a comment - We don't need to suppress tombstones – we just need to make sure that any data that the tombstones we're compacting, is new enough that we don't need the tombstones to suppress them. In other words, that throwing away our tombstones won't make deleted data start showing up again.
        Hide
        Yuki Morishita added a comment -

        I tried to use timestamp to determine whether sstable should be compacted, but it does not guarantee to suppress tombstones. Tombstones only get dropped when those keys don't appear in other sstables besides compacting ones. Currently I think the only way to stop compaction loop is to make sure interested sstable does not have overlap so its tombstones actually drop.

        Show
        Yuki Morishita added a comment - I tried to use timestamp to determine whether sstable should be compacted, but it does not guarantee to suppress tombstones. Tombstones only get dropped when those keys don't appear in other sstables besides compacting ones. Currently I think the only way to stop compaction loop is to make sure interested sstable does not have overlap so its tombstones actually drop.
        Hide
        Jonathan Ellis added a comment -

        I think we should check for overlaps and timestamp is old enough to have data we care about suppressing. The former alone will be common in size-tiered compaction.

        Show
        Jonathan Ellis added a comment - I think we should check for overlaps and timestamp is old enough to have data we care about suppressing. The former alone will be common in size-tiered compaction.
        Hide
        Yuki Morishita added a comment -

        One possible solution is to add check for overwrap of key range stored in sstable.
        Patch attached with minor fix for test.

        Show
        Yuki Morishita added a comment - One possible solution is to add check for overwrap of key range stored in sstable. Patch attached with minor fix for test.
        Hide
        Jonathan Ellis added a comment -

        But there is a situation which compaction does not drop tombstones. That happens when the key in compacting sstable appears in another sstables.

        What if we checked the most-recent-timestamp of the other sstables, and avoid compaction only if there is potentially data old enough that we'd need the tombstones to suppress it?

        Show
        Jonathan Ellis added a comment - But there is a situation which compaction does not drop tombstones. That happens when the key in compacting sstable appears in another sstables. What if we checked the most-recent-timestamp of the other sstables, and avoid compaction only if there is potentially data old enough that we'd need the tombstones to suppress it?
        Hide
        Yuki Morishita added a comment -

        The basic idea behind CASSANDRA-3442 is to perform single sstable compaction if its droppable tombstone ratio is above threshold.
        This works because single sstable compaction drops tombstones and lowers droppable tombstone ratio, which prevents recursive compaction on that sstable.

        But there is a situation which compaction does not drop tombstones. That happens when the key in compacting sstable appears in another sstables.

        Let me find the way to handle above case...

        Show
        Yuki Morishita added a comment - The basic idea behind CASSANDRA-3442 is to perform single sstable compaction if its droppable tombstone ratio is above threshold. This works because single sstable compaction drops tombstones and lowers droppable tombstone ratio, which prevents recursive compaction on that sstable. But there is a situation which compaction does not drop tombstones. That happens when the key in compacting sstable appears in another sstables. Let me find the way to handle above case...
        Hide
        Brandon Williams added a comment -

        Confirmed that this only happens with CASSANDRA-3442 applied.

        Show
        Brandon Williams added a comment - Confirmed that this only happens with CASSANDRA-3442 applied.
        Hide
        Brandon Williams added a comment -

        What is happening very reproducibly now is that I started the node, and 5 minutes later the forced compaction check in ACS kicks off, and then I have looping compaction on the hints but it's only compacting the last sstable over and over.

        Show
        Brandon Williams added a comment - What is happening very reproducibly now is that I started the node, and 5 minutes later the forced compaction check in ACS kicks off, and then I have looping compaction on the hints but it's only compacting the last sstable over and over.
        Hide
        Brandon Williams added a comment -

        I should note that the machine does not hand anything off, so everything in these sstables must be tombstones.

        Show
        Brandon Williams added a comment - I should note that the machine does not hand anything off, so everything in these sstables must be tombstones.
        Hide
        Brandon Williams added a comment -

        Yuki mentions that it may be caused by CASSANDRA-3442 too.

        Show
        Brandon Williams added a comment - Yuki mentions that it may be caused by CASSANDRA-3442 too.
        Hide
        Brandon Williams added a comment -

        It seems part of the problem is it doesn't know about one of the sstables it needs to compact:

        cassandra-1:/srv/cassandra# ls -l /var/lib/cassandra/data/system/HintsColumnFamily/
        total 66804
        -rw-r--r-- 1 root root 63642821 Mar  8 17:36 system-HintsColumnFamily-hd-13-Data.db
        -rw-r--r-- 1 root root       80 Mar  8 17:36 system-HintsColumnFamily-hd-13-Digest.sha1
        -rw-r--r-- 1 root root      976 Mar  8 17:36 system-HintsColumnFamily-hd-13-Filter.db
        -rw-r--r-- 1 root root       26 Mar  8 17:36 system-HintsColumnFamily-hd-13-Index.db
        -rw-r--r-- 1 root root     4344 Mar  8 17:36 system-HintsColumnFamily-hd-13-Statistics.db
        -rw-r--r-- 1 root root  4637160 Mar  8 17:59 system-HintsColumnFamily-hd-639-Data.db
        -rw-r--r-- 1 root root       81 Mar  8 17:59 system-HintsColumnFamily-hd-639-Digest.sha1
        -rw-r--r-- 1 root root      496 Mar  8 17:59 system-HintsColumnFamily-hd-639-Filter.db
        -rw-r--r-- 1 root root       26 Mar  8 17:59 system-HintsColumnFamily-hd-639-Index.db
        -rw-r--r-- 1 root root     5944 Mar  8 17:59 system-HintsColumnFamily-hd-639-Statistics.db
        -rw-r--r-- 1 root root        0 Mar  8 17:59 system-HintsColumnFamily-tmp-hd-640-Data.db
        -rw-r--r-- 1 root root        0 Mar  8 17:59 system-HintsColumnFamily-tmp-hd-640-Index.db
        
        Show
        Brandon Williams added a comment - It seems part of the problem is it doesn't know about one of the sstables it needs to compact: cassandra-1:/srv/cassandra# ls -l /var/lib/cassandra/data/system/HintsColumnFamily/ total 66804 -rw-r--r-- 1 root root 63642821 Mar 8 17:36 system-HintsColumnFamily-hd-13-Data.db -rw-r--r-- 1 root root 80 Mar 8 17:36 system-HintsColumnFamily-hd-13-Digest.sha1 -rw-r--r-- 1 root root 976 Mar 8 17:36 system-HintsColumnFamily-hd-13-Filter.db -rw-r--r-- 1 root root 26 Mar 8 17:36 system-HintsColumnFamily-hd-13-Index.db -rw-r--r-- 1 root root 4344 Mar 8 17:36 system-HintsColumnFamily-hd-13-Statistics.db -rw-r--r-- 1 root root 4637160 Mar 8 17:59 system-HintsColumnFamily-hd-639-Data.db -rw-r--r-- 1 root root 81 Mar 8 17:59 system-HintsColumnFamily-hd-639-Digest.sha1 -rw-r--r-- 1 root root 496 Mar 8 17:59 system-HintsColumnFamily-hd-639-Filter.db -rw-r--r-- 1 root root 26 Mar 8 17:59 system-HintsColumnFamily-hd-639-Index.db -rw-r--r-- 1 root root 5944 Mar 8 17:59 system-HintsColumnFamily-hd-639-Statistics.db -rw-r--r-- 1 root root 0 Mar 8 17:59 system-HintsColumnFamily-tmp-hd-640-Data.db -rw-r--r-- 1 root root 0 Mar 8 17:59 system-HintsColumnFamily-tmp-hd-640-Index.db

          People

          • Assignee:
            Yuki Morishita
            Reporter:
            Brandon Williams
            Reviewer:
            Jonathan Ellis
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development