Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Fix Version/s: 3.x
    • Component/s: Core
    • Labels:

      Description

      We can consider an SSTable as a set of partition keys, and 'compaction' as de-duplication of those partition keys.
      We want to find compaction candidates from SSTables that have as many same keys as possible. If we can group similar SSTables based on some measurement, we can achieve more efficient compaction.
      One such measurement is Jaccard Distance,

      which we can estimate using technique called MinHash.

      In Cassandra, we can calculate and store MinHash signature when writing SSTable. New compaction strategy uses the signature to find the group of similar SSTable for compaction candidates. We can always fall back to STCS when such candidates are not exists.

      This is just an idea floating around my head, but before I forget, I dump it here. For introduction to this technique, Chapter 3 of 'Mining of Massive Datasets' is a good start.

        Issue Links

          Activity

          Hide
          Pavel Yaskevich added a comment -

          We have tried MinHash + LCS while considering compaction strategies for our use-cases, the problem with MinHash is it's pretty expensive to generate and doesn't really help much as there is a way to split Memtable into multiple SSTable "on flush" (yet), otherwise compaction would still do a lot of unnecessary data copying.

          Show
          Pavel Yaskevich added a comment - We have tried MinHash + LCS while considering compaction strategies for our use-cases, the problem with MinHash is it's pretty expensive to generate and doesn't really help much as there is a way to split Memtable into multiple SSTable "on flush" (yet), otherwise compaction would still do a lot of unnecessary data copying.
          Hide
          sankalp kohli added a comment - - edited

          I am not working on it. I might get time next week to work on it. Feel free to take it if you like.

          Show
          sankalp kohli added a comment - - edited I am not working on it. I might get time next week to work on it. Feel free to take it if you like.
          Hide
          Marcus Eriksson added a comment -

          sankalp kohli is this still on your radar?

          Show
          Marcus Eriksson added a comment - sankalp kohli is this still on your radar?
          Hide
          Jonathan Ellis added a comment - - edited

          Over in 6216 I said

          we should prefer smallest write amplification where

          wa = (overlapped size-minus-tombstones + candidate size-minus-tombstones) / candidate size-minus-tombstones

          With HLL we could actually compute write amplification directly instead of guessing based on size. But as I said above, I'm not sure this will be sufficiently better than size-estimation to be worth keeping HLL memory-resident.

          Show
          Jonathan Ellis added a comment - - edited Over in 6216 I said we should prefer smallest write amplification where wa = (overlapped size-minus-tombstones + candidate size-minus-tombstones) / candidate size-minus-tombstones With HLL we could actually compute write amplification directly instead of guessing based on size. But as I said above, I'm not sure this will be sufficiently better than size-estimation to be worth keeping HLL memory-resident.
          Hide
          sankalp kohli added a comment -

          I like the first approach since it is simpler in which we will have HLL using 1K per sstable.

          Regarding you second approach, what do you mean by "For LCS it would only be useful for prioritization"?
          So you will calculate the stable set based on description in CASSANDRA-6216 but I am not sure how you will use HLL for prioritization.

          Show
          sankalp kohli added a comment - I like the first approach since it is simpler in which we will have HLL using 1K per sstable. Regarding you second approach, what do you mean by "For LCS it would only be useful for prioritization"? So you will calculate the stable set based on description in CASSANDRA-6216 but I am not sure how you will use HLL for prioritization.
          Hide
          Jonathan Ellis added a comment -

          I see that https://github.com/aggregateknowledge/java-hll (thanks Theo Hultberg) claims that 1280 bytes is enough to count "billions" of values with "a few percent" error with HLL.

          1K per sstable gets us to 16TB of data in 1GB of HLL at the default 160MB/sstable LCS which seems quite reasonable to me, especially if we can move it off heap later. (Since few people are even approaching 5TB per node I think it's fine to leave it resident, on heap for now.)

          If that gets us more information than the minhash approach then that sounds like a win to me both for better compaction and simplicity of implementation. If we have to fudge our BF allocations by 5% or so, I'm okay with that trade.

          Another thought: cardinality estimation will primarily be useful for STCS, or whatever we call a new Similarity-based strategy. So I'm also okay with saying, "we'll calculate HLL for LCS and use it to calculate BF size accurately, but for STCS we'll keep it resident and use it to calculate candidates." For LCS it would only be useful for prioritization, since the set of potential candidates is determined by the level-sorting – and given the order of magnitude difference in each level, size-based estimates of similarity a la hyperdex should be "almost" as good. (http://hackingdistributed.com/2013/06/17/hyperleveldb/)

          Show
          Jonathan Ellis added a comment - I see that https://github.com/aggregateknowledge/java-hll (thanks Theo Hultberg ) claims that 1280 bytes is enough to count "billions" of values with "a few percent" error with HLL. 1K per sstable gets us to 16TB of data in 1GB of HLL at the default 160MB/sstable LCS which seems quite reasonable to me, especially if we can move it off heap later. (Since few people are even approaching 5TB per node I think it's fine to leave it resident, on heap for now.) If that gets us more information than the minhash approach then that sounds like a win to me both for better compaction and simplicity of implementation. If we have to fudge our BF allocations by 5% or so, I'm okay with that trade. Another thought: cardinality estimation will primarily be useful for STCS, or whatever we call a new Similarity-based strategy. So I'm also okay with saying, "we'll calculate HLL for LCS and use it to calculate BF size accurately, but for STCS we'll keep it resident and use it to calculate candidates." For LCS it would only be useful for prioritization, since the set of potential candidates is determined by the level-sorting – and given the order of magnitude difference in each level, size-based estimates of similarity a la hyperdex should be "almost" as good. ( http://hackingdistributed.com/2013/06/17/hyperleveldb/ )
          Hide
          Benedict added a comment -

          Yuki Morishita It sounds like you're only aiming to select the best compaction candidates, rather than potentially avoiding compacting files that can be said to be (mostly) non-intersecting. Wouldn't the latter be even more useful, especially for use cases where data is mostly appending to unique PartitionKeys?

          Also, it seems for this we could supplement whatever data we retain for the initial calculation with a sampling of one of the files (which, given Murmur3 gives a good spread of data could probably be optimised to scanning a percentage of pages, rather than random records) to give tighter bounds on the actual overlap, in the case we decide not to compact two files for this reason.

          Show
          Benedict added a comment - Yuki Morishita It sounds like you're only aiming to select the best compaction candidates, rather than potentially avoiding compacting files that can be said to be (mostly) non-intersecting. Wouldn't the latter be even more useful, especially for use cases where data is mostly appending to unique PartitionKeys? Also, it seems for this we could supplement whatever data we retain for the initial calculation with a sampling of one of the files (which, given Murmur3 gives a good spread of data could probably be optimised to scanning a percentage of pages, rather than random records) to give tighter bounds on the actual overlap, in the case we decide not to compact two files for this reason.
          Hide
          Yuki Morishita added a comment -

          I think we don't need so much accuracy for this. We just want to find the set that contains SSTables of resemblance of, say, more than 0.5.
          There is storage efficient improvement called b-bit minwise hashing[1][2], which stores only b bit (b=1, 2, 3,...) for each hash, that we can use in the case above.

          1. http://research.microsoft.com/pubs/120078/wfc0398-lips.pdf
          2. http://research.microsoft.com/pubs/152334/cacm_hashing.pdf

          Show
          Yuki Morishita added a comment - I think we don't need so much accuracy for this. We just want to find the set that contains SSTables of resemblance of, say, more than 0.5. There is storage efficient improvement called b-bit minwise hashing[1][2], which stores only b bit (b=1, 2, 3,...) for each hash, that we can use in the case above. 1. http://research.microsoft.com/pubs/120078/wfc0398-lips.pdf 2. http://research.microsoft.com/pubs/152334/cacm_hashing.pdf
          Hide
          Jonathan Ellis added a comment -

          So I'm running the numbers and it's not 100% obvious to me that minhash actually does better here:

          1. HLL of p=15 is about 10KB and has an error rate of 0.5%
          2. According to http://en.wikipedia.org/wiki/MinHash the expected error for minhash is 1/sqrt(k) for k hashes so to get to 0.5% error rate we'd need 40,000 hashes (160KB)

          So, for a relatively low error rate HLL wins. Does it just not "scale down" as well as minhash? How well would HLL do with a 1600 byte estimate, which minhash would give us 5% error on? What about 400 bytes?

          How accurate do we need the estimate to be, to be useful to compaction?

          Incidentally, I'm not sure how hash size affects the 1/sqrt(k) computation, surely an 8-byte hash would give more accurate results than a 2-byte hash?

          Show
          Jonathan Ellis added a comment - So I'm running the numbers and it's not 100% obvious to me that minhash actually does better here: HLL of p=15 is about 10KB and has an error rate of 0.5% According to http://en.wikipedia.org/wiki/MinHash the expected error for minhash is 1/sqrt(k) for k hashes so to get to 0.5% error rate we'd need 40,000 hashes (160KB) So, for a relatively low error rate HLL wins. Does it just not "scale down" as well as minhash? How well would HLL do with a 1600 byte estimate, which minhash would give us 5% error on? What about 400 bytes? How accurate do we need the estimate to be, to be useful to compaction? Incidentally, I'm not sure how hash size affects the 1/sqrt(k) computation, surely an 8-byte hash would give more accurate results than a 2-byte hash?
          Hide
          sankalp kohli added a comment -

          Yup make sense. We can have like 400 4 byte min hashes in memory.

          Show
          sankalp kohli added a comment - Yup make sense. We can have like 400 4 byte min hashes in memory.
          Hide
          Jonathan Ellis added a comment -

          Yes, but the reason Yuki proposes minhash here is that the HLL+ components are too large to keep on-heap all the time (think LCS with 1000s of sstables) so we only load them once we've already decided which sstables to compact.

          To do a search of "which overlap the most" we need to have it in memory for all sstables.

          Show
          Jonathan Ellis added a comment - Yes, but the reason Yuki proposes minhash here is that the HLL+ components are too large to keep on-heap all the time (think LCS with 1000s of sstables) so we only load them once we've already decided which sstables to compact. To do a search of "which overlap the most" we need to have it in memory for all sstables.
          Hide
          sankalp kohli added a comment -

          I think HyperLogLogPlus cardinality we added to stats might be useful here.
          Say stable1 has a cardinality of 100 and stable2 has a cardinality of 50. Say we merge these and we got a cardinality of 130. This means, that there are 20 overlapping rows in these stables. With this, we can know how similar are these stables and calculate Jaccard Distance.
          Jonathan Ellis Does it sound right?

          Show
          sankalp kohli added a comment - I think HyperLogLogPlus cardinality we added to stats might be useful here. Say stable1 has a cardinality of 100 and stable2 has a cardinality of 50. Say we merge these and we got a cardinality of 130. This means, that there are 20 overlapping rows in these stables. With this, we can know how similar are these stables and calculate Jaccard Distance. Jonathan Ellis Does it sound right?
          Hide
          sankalp kohli added a comment -

          This JIRA has two parts. We can split that work in two dependent JIRAs also.
          1) We need store x min hashes of all the rows in an stable. I am not sure whether Murmur3 is a good candidate.
          2) Find an stable in Level L such that all its overlapping stables in level L+1 have maximum Jaccard Distance or have similar rows.

          How does this sound?

          Show
          sankalp kohli added a comment - This JIRA has two parts. We can split that work in two dependent JIRAs also. 1) We need store x min hashes of all the rows in an stable. I am not sure whether Murmur3 is a good candidate. 2) Find an stable in Level L such that all its overlapping stables in level L+1 have maximum Jaccard Distance or have similar rows. How does this sound?

            People

            • Assignee:
              sankalp kohli
              Reporter:
              Yuki Morishita
            • Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:

                Development