Cassandra
  1. Cassandra
  2. CASSANDRA-5722

Cleanup should skip sstables that don't contain data outside a nodes ranges

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 2.0.1
    • Component/s: None
    • Labels:
      None

      Description

      Right now cleanup is optimized to simply delete sstables that only contain data that doesn't belong on the node, for all other sstables though, it will read them, check each row, and write out new sstables.

      Cleanup could be optimized to look at an sstable and determine that all data within the sstable does belong on a node, and therefore skip re-writing that sstable. This would make cleanup essentially a noop in the case where all data on a node belongs on that node.

      1. 5722-cassandra-2.0.patch
        14 kB
        Tyler Hobbs
      2. 0001-Skip-cleanup-when-unneeded.patch
        12 kB
        Tyler Hobbs

        Activity

        Hide
        Jonathan Ellis added a comment -

        committed

        Show
        Jonathan Ellis added a comment - committed
        Hide
        Tyler Hobbs added a comment -

        Attached file 5722-cassandra-2.0.patch (and branch) is rebased against 2.0. It includes the cleanup from your 5722 branch and an additional bugfix that was caught by a 2.0 test.

        Show
        Tyler Hobbs added a comment - Attached file 5722-cassandra-2.0.patch (and branch ) is rebased against 2.0. It includes the cleanup from your 5722 branch and an additional bugfix that was caught by a 2.0 test.
        Hide
        Jonathan Ellis added a comment -

        Can you rebase for the 2.0 branch?

        Show
        Jonathan Ellis added a comment - Can you rebase for the 2.0 branch?
        Hide
        Jonathan Ellis added a comment -

        All right. I'll commit once we branch off 2.0.0; thanks.

        Show
        Jonathan Ellis added a comment - All right. I'll commit once we branch off 2.0.0; thanks.
        Hide
        Tyler Hobbs added a comment -

        If you think about it, you can see why that might be so – we only have to scan extra rows on a bloom filter false positive. The common case by the time we start looping through rows is that the row we're looking for exists.

        Ahh, that makes perfect sense. Thanks!

        I can squash it down and commit if you're good with the changes.

        +1, sounds good.

        Show
        Tyler Hobbs added a comment - If you think about it, you can see why that might be so – we only have to scan extra rows on a bloom filter false positive. The common case by the time we start looping through rows is that the row we're looking for exists. Ahh, that makes perfect sense. Thanks! I can squash it down and commit if you're good with the changes. +1, sounds good.
        Hide
        Jonathan Ellis added a comment -

        Should I go ahead and roll your changes into a new patch?

        I can squash it down and commit if you're good with the changes.

        Show
        Jonathan Ellis added a comment - Should I go ahead and roll your changes into a new patch? I can squash it down and commit if you're good with the changes.
        Hide
        Jonathan Ellis added a comment -

        Sorry, I was really fixating on trying to relate your comment to that commit. I see what you mean, now.

        That's actually the way it used to work, pre-CASSANDRA-4710 – tldr, in Daniel's workload, the decoration was indeed higher than the extra work from scanning the extra rows.

        If you think about it, you can see why that might be so – we only have to scan extra rows on a bloom filter false positive. The common case by the time we start looping through rows is that the row we're looking for exists.

        Now that we allow disabling BF entirely that might not always be the case, but if you're disabling your BF I have to assume you know what you're doing and are prepared for the consequences.

        Show
        Jonathan Ellis added a comment - Sorry, I was really fixating on trying to relate your comment to that commit. I see what you mean, now. That's actually the way it used to work, pre- CASSANDRA-4710 – tldr, in Daniel's workload, the decoration was indeed higher than the extra work from scanning the extra rows. If you think about it, you can see why that might be so – we only have to scan extra rows on a bloom filter false positive. The common case by the time we start looping through rows is that the row we're looking for exists. Now that we allow disabling BF entirely that might not always be the case, but if you're disabling your BF I have to assume you know what you're doing and are prepared for the consequences.
        Hide
        Tyler Hobbs added a comment -

        However, now that I think about it, I believe your change to firstKeyBeyond() may mishandle one case. It needs to check the first key of the next range. Otherwise, if the token falls at the end of the index range (past the last indexed key), it will improperly return null, indicating no keys beyond that. I'll add a test to verify and comment if so.

        I was incorrect here, the simplification is good.

        Show
        Tyler Hobbs added a comment - However, now that I think about it, I believe your change to firstKeyBeyond() may mishandle one case. It needs to check the first key of the next range. Otherwise, if the token falls at the end of the index range (past the last indexed key), it will improperly return null, indicating no keys beyond that. I'll add a test to verify and comment if so. I was incorrect here, the simplification is good.
        Hide
        Tyler Hobbs added a comment -

        Sorry for needing to be brief. That commit just got me thinking about findPosition()'s behavior, since I borrowed parts of it for firstKeyBeyond(). Not a necessary change for this ticket, just a potential optimization we may have missed in getPosition() and some explanation for my own understanding of the performance characteristics.

        Show
        Tyler Hobbs added a comment - Sorry for needing to be brief. That commit just got me thinking about findPosition()'s behavior, since I borrowed parts of it for firstKeyBeyond(). Not a necessary change for this ticket, just a potential optimization we may have missed in getPosition() and some explanation for my own understanding of the performance characteristics.
        Hide
        Jonathan Ellis added a comment -

        my comment was not about firstKeyBeyond(), but about a potential optimization to findPosition()

        Did you link the wrong commit or am I still missing something? You linked the commit that removes the i++ code from firstKeyBeyond.

        Show
        Jonathan Ellis added a comment - my comment was not about firstKeyBeyond(), but about a potential optimization to findPosition() Did you link the wrong commit or am I still missing something? You linked the commit that removes the i++ code from firstKeyBeyond.
        Hide
        Tyler Hobbs added a comment -

        Sorry, to be clear, my comment was not about firstKeyBeyond(), but about a potential optimization to findPosition(). In findPosition(), we don't use compareTo() in the EQ case (presumably because of the cost of decorating to allow that?), even though using it could save ~64 loops on average.

        However, now that I think about it, I believe your change to firstKeyBeyond() may mishandle one case. It needs to check the first key of the next range. Otherwise, if the token falls at the end of the index range (past the last indexed key), it will improperly return null, indicating no keys beyond that. I'll add a test to verify and comment if so.

        Show
        Tyler Hobbs added a comment - Sorry, to be clear, my comment was not about firstKeyBeyond(), but about a potential optimization to findPosition(). In findPosition(), we don't use compareTo() in the EQ case (presumably because of the cost of decorating to allow that?), even though using it could save ~64 loops on average. However, now that I think about it, I believe your change to firstKeyBeyond() may mishandle one case. It needs to check the first key of the next range. Otherwise, if the token falls at the end of the index range (past the last indexed key), it will improperly return null, indicating no keys beyond that. I'll add a test to verify and comment if so.
        Hide
        Jonathan Ellis added a comment -

        is the cost of decorating index keys so high that it outweighs the savings from exiting the loop earlier when a greater key is found?

        That's exactly what we want to do, and that's what the indexDecoratedKey.compareTo(position) > 0 check does for us. The part I removed allows us to skip this check when we don't find key greater than the position before we've finished the block where such a key would exist, i.e., it saves us exactly one iteration of the loop [since the first key out of the next block is guaranteed to be greater].

        I thought that fell into the category of premature optimization and took it out so it was more clear what we're doing. Did I miss something?

        Show
        Jonathan Ellis added a comment - is the cost of decorating index keys so high that it outweighs the savings from exiting the loop earlier when a greater key is found? That's exactly what we want to do, and that's what the indexDecoratedKey.compareTo(position) > 0 check does for us. The part I removed allows us to skip this check when we don't find key greater than the position before we've finished the block where such a key would exist, i.e., it saves us exactly one iteration of the loop [since the first key out of the next block is guaranteed to be greater] . I thought that fell into the category of premature optimization and took it out so it was more clear what we're doing. Did I miss something?
        Hide
        Tyler Hobbs added a comment -

        Quick question around your commit here: https://github.com/jbellis/cassandra/commit/b9194842fc9eacc4fd55813cd200fa55f5713072. In getPosition() (which I lifted parts of), is the cost of decorating index keys so high that it outweighs the savings from exiting the loop earlier when a greater key is found? In other words, if we were to always decorate and do a normal compareTo() (even in the EQ case) with a return/break when the index key was higher, could that not potentially give better performance (skipping ~64 checks on average)?

        Should I go ahead and roll your changes into a new patch?

        I'm totally fine with doing CASSANDRA-2524. Shouldn't be too hard.

        Show
        Tyler Hobbs added a comment - Quick question around your commit here: https://github.com/jbellis/cassandra/commit/b9194842fc9eacc4fd55813cd200fa55f5713072 . In getPosition() (which I lifted parts of), is the cost of decorating index keys so high that it outweighs the savings from exiting the loop earlier when a greater key is found? In other words, if we were to always decorate and do a normal compareTo() (even in the EQ case) with a return/break when the index key was higher, could that not potentially give better performance (skipping ~64 checks on average)? Should I go ahead and roll your changes into a new patch? I'm totally fine with doing CASSANDRA-2524 . Shouldn't be too hard.
        Hide
        Jonathan Ellis added a comment - - edited

        Note to self, I do think we're 80% of the way to CASSANDRA-2524 here, so I should give that to you next.

        Edit: Maybe 50%, looks like SSTScanner only takes a single Range as of yet.

        Show
        Jonathan Ellis added a comment - - edited Note to self, I do think we're 80% of the way to CASSANDRA-2524 here, so I should give that to you next. Edit: Maybe 50%, looks like SSTScanner only takes a single Range as of yet.
        Hide
        Jonathan Ellis added a comment -

        Thanks! Pushed my cleanup to https://github.com/jbellis/cassandra/tree/5722.

        Show
        Jonathan Ellis added a comment - Thanks! Pushed my cleanup to https://github.com/jbellis/cassandra/tree/5722 .
        Hide
        Tyler Hobbs added a comment -

        Patch 0001 (and branch) skips cleanup when possible by checking portions of the primary index around range boundaries.

        Show
        Tyler Hobbs added a comment - Patch 0001 (and branch ) skips cleanup when possible by checking portions of the primary index around range boundaries.
        Hide
        Tyler Hobbs added a comment -

        I wasn't aware of CASSANDRA-2524, so I'll look into that and see if there's any overlap. Thinking about vnodes is actually why I suggested that technique instead of simply checking first/last keys for the sstable.

        It was easy to put together a quick first version (not polished); you can view a diff here: https://github.com/thobbs/cassandra/compare/apache:trunk...CASSANDRA-5722. Even with vnodes, I think it's a win.

        Show
        Tyler Hobbs added a comment - I wasn't aware of CASSANDRA-2524 , so I'll look into that and see if there's any overlap. Thinking about vnodes is actually why I suggested that technique instead of simply checking first/last keys for the sstable. It was easy to put together a quick first version (not polished); you can view a diff here: https://github.com/thobbs/cassandra/compare/apache:trunk...CASSANDRA-5722 . Even with vnodes, I think it's a win.
        Hide
        Nick Bailey added a comment -

        After thinking about this a bit more, this is going to be pretty useless with vnodes isn't it? Each sstable is going to contain keys from all over the ring so this probably won't be able to skip any.

        I guess there are enough people not on vnodes it could be worth it still though.

        Show
        Nick Bailey added a comment - After thinking about this a bit more, this is going to be pretty useless with vnodes isn't it? Each sstable is going to contain keys from all over the ring so this probably won't be able to skip any. I guess there are enough people not on vnodes it could be worth it still though.
        Hide
        Jonathan Ellis added a comment -

        Oh, I think I understand. You're thinking along the lines of CASSANDRA-2524 ?

        Show
        Jonathan Ellis added a comment - Oh, I think I understand. You're thinking along the lines of CASSANDRA-2524 ?
        Hide
        Jonathan Ellis added a comment -

        I'm lost, isn't this a simple comparison on first/last keys from the sstable summary?

        Show
        Jonathan Ellis added a comment - I'm lost, isn't this a simple comparison on first/last keys from the sstable summary?
        Hide
        Tyler Hobbs added a comment -

        To handle non-contiguous ranges, I thinking we can iterate over the owned token range boundaries for the local node, binary search for the approximate index position with the IndexSummary, and then check the index to see if the next/previous key is outside of the owned ranges. Does that sound reasonable?

        Show
        Tyler Hobbs added a comment - To handle non-contiguous ranges, I thinking we can iterate over the owned token range boundaries for the local node, binary search for the approximate index position with the IndexSummary, and then check the index to see if the next/previous key is outside of the owned ranges. Does that sound reasonable?

          People

          • Assignee:
            Tyler Hobbs
            Reporter:
            Nick Bailey
            Reviewer:
            Jonathan Ellis
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development