Cassandra
  1. Cassandra
  2. CASSANDRA-5906

Avoid allocating over-large bloom filters

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 2.1 beta1
    • Component/s: Core
    • Labels:
      None

      Description

      We conservatively estimate the number of partitions post-compaction to be the total number of partitions pre-compaction. That is, we assume the worst-case scenario of no partition overlap at all.

      This can result in substantial memory wasted in sstables resulting from highly overlapping compactions.

      1. 5906.txt
        205 kB
        Yuki Morishita

        Issue Links

          Activity

          Hide
          Jonathan Ellis added a comment -

          I see two options:

          1. Build the BF [from the index] after the rest of compaction is complete and we know how many partitions need to be accounted for post-merge.
          2. Use merge statistics (CASSANDRA-5078) to make a better estimate
          Show
          Jonathan Ellis added a comment - I see two options: Build the BF [from the index] after the rest of compaction is complete and we know how many partitions need to be accounted for post-merge. Use merge statistics ( CASSANDRA-5078 ) to make a better estimate
          Hide
          Yuki Morishita added a comment -

          One experiment that I took is to use HyperLogLog++ for cardinality estimation of keys in compaction set.

          https://github.com/yukim/cassandra/tree/5906-wip

          (HLL++ implementation is from https://github.com/clearspring/stream-lib)

          Basic idea is to save HLL state of each SSTable to Statistics.db(SSTableMetadata) at flush/compaction and merge them when compacting.

          The following shows the result of size tiered compaction after 2 runs of 'cassandra-stress -n 500000'.

          // cassandra-2.0 branch
          -rw-r--r--   1 yuki  staff       8460 Sep 13 16:28 Keyspace1-Standard1-jb-5-CRC.db
          -rw-r--r--   1 yuki  staff  138500000 Sep 13 16:28 Keyspace1-Standard1-jb-5-Data.db
          -rw-r--r--   1 yuki  staff         74 Sep 13 16:28 Keyspace1-Standard1-jb-5-Digest.sha1
          -rw-r--r--   1 yuki  staff    1141616 Sep 13 16:28 Keyspace1-Standard1-jb-5-Filter.db
          -rw-r--r--   1 yuki  staff   10000000 Sep 13 16:28 Keyspace1-Standard1-jb-5-Index.db
          -rw-r--r--   1 yuki  staff       4405 Sep 13 16:28 Keyspace1-Standard1-jb-5-Statistics.db
          -rw-r--r--   1 yuki  staff      70414 Sep 13 16:28 Keyspace1-Standard1-jb-5-Summary.db
          -rw-r--r--   1 yuki  staff         79 Sep 13 16:28 Keyspace1-Standard1-jb-5-TOC.txt
          
          // 5906-wip branch
          -rw-r--r--   1 yuki  staff       8460 Sep 13 16:34 Keyspace1-Standard1-jb-5-CRC.db
          -rw-r--r--   1 yuki  staff  138500000 Sep 13 16:34 Keyspace1-Standard1-jb-5-Data.db
          -rw-r--r--   1 yuki  staff         74 Sep 13 16:34 Keyspace1-Standard1-jb-5-Digest.sha1
          -rw-r--r--   1 yuki  staff     623240 Sep 13 16:34 Keyspace1-Standard1-jb-5-Filter.db
          -rw-r--r--   1 yuki  staff   10000000 Sep 13 16:34 Keyspace1-Standard1-jb-5-Index.db
          -rw-r--r--   1 yuki  staff      48111 Sep 13 16:34 Keyspace1-Standard1-jb-5-Statistics.db
          -rw-r--r--   1 yuki  staff      70414 Sep 13 16:34 Keyspace1-Standard1-jb-5-Summary.db
          -rw-r--r--   1 yuki  staff         79 Sep 13 16:34 Keyspace1-Standard1-jb-5-TOC.txt
          
          component cassandra-2.0 5906-wip
          -Filter.db 1,141,616 623,240
          -Statistics.db 4,405 48,111

          HLL bytes size is fixed so the size of Stats.db is almost the same for every SSTables.
          It is also configurable(it's a trade off of number of keys to hold and error rate).

          With current HLL parameter, estimated #keys was 498,578(err: 0.28%).

          I will continue to experiment with various parameters, setups, but the above is what I have for now.

          Any thoughts?

          Show
          Yuki Morishita added a comment - One experiment that I took is to use HyperLogLog++ for cardinality estimation of keys in compaction set. https://github.com/yukim/cassandra/tree/5906-wip (HLL++ implementation is from https://github.com/clearspring/stream-lib ) Basic idea is to save HLL state of each SSTable to Statistics.db(SSTableMetadata) at flush/compaction and merge them when compacting. The following shows the result of size tiered compaction after 2 runs of 'cassandra-stress -n 500000'. // cassandra-2.0 branch -rw-r--r-- 1 yuki staff 8460 Sep 13 16:28 Keyspace1-Standard1-jb-5-CRC.db -rw-r--r-- 1 yuki staff 138500000 Sep 13 16:28 Keyspace1-Standard1-jb-5-Data.db -rw-r--r-- 1 yuki staff 74 Sep 13 16:28 Keyspace1-Standard1-jb-5-Digest.sha1 -rw-r--r-- 1 yuki staff 1141616 Sep 13 16:28 Keyspace1-Standard1-jb-5-Filter.db -rw-r--r-- 1 yuki staff 10000000 Sep 13 16:28 Keyspace1-Standard1-jb-5-Index.db -rw-r--r-- 1 yuki staff 4405 Sep 13 16:28 Keyspace1-Standard1-jb-5-Statistics.db -rw-r--r-- 1 yuki staff 70414 Sep 13 16:28 Keyspace1-Standard1-jb-5-Summary.db -rw-r--r-- 1 yuki staff 79 Sep 13 16:28 Keyspace1-Standard1-jb-5-TOC.txt // 5906-wip branch -rw-r--r-- 1 yuki staff 8460 Sep 13 16:34 Keyspace1-Standard1-jb-5-CRC.db -rw-r--r-- 1 yuki staff 138500000 Sep 13 16:34 Keyspace1-Standard1-jb-5-Data.db -rw-r--r-- 1 yuki staff 74 Sep 13 16:34 Keyspace1-Standard1-jb-5-Digest.sha1 -rw-r--r-- 1 yuki staff 623240 Sep 13 16:34 Keyspace1-Standard1-jb-5-Filter.db -rw-r--r-- 1 yuki staff 10000000 Sep 13 16:34 Keyspace1-Standard1-jb-5-Index.db -rw-r--r-- 1 yuki staff 48111 Sep 13 16:34 Keyspace1-Standard1-jb-5-Statistics.db -rw-r--r-- 1 yuki staff 70414 Sep 13 16:34 Keyspace1-Standard1-jb-5-Summary.db -rw-r--r-- 1 yuki staff 79 Sep 13 16:34 Keyspace1-Standard1-jb-5-TOC.txt component cassandra-2.0 5906-wip -Filter.db 1,141,616 623,240 -Statistics.db 4,405 48,111 HLL bytes size is fixed so the size of Stats.db is almost the same for every SSTables. It is also configurable(it's a trade off of number of keys to hold and error rate). With current HLL parameter, estimated #keys was 498,578(err: 0.28%). I will continue to experiment with various parameters, setups, but the above is what I have for now. Any thoughts?
          Hide
          Brandon Williams added a comment -
          Show
          Brandon Williams added a comment - Ping Chris Burroughs
          Hide
          Matt Abrams added a comment -

          Glad to see HLL++ might be helpful here. I think the approach mentioned above is a good one. One note of caution is that HLL++ (and HLL) are not thread safe. So you may need to synchronize access to the estimator if multiple threads will be updating it concurrently. There has been some discussion of creating a thread safe version of HLL++ but the general consensus has been that synchronization is better done at the client level. Let me know if you have strong opinions on this.

          Show
          Matt Abrams added a comment - Glad to see HLL++ might be helpful here. I think the approach mentioned above is a good one. One note of caution is that HLL++ (and HLL) are not thread safe. So you may need to synchronize access to the estimator if multiple threads will be updating it concurrently. There has been some discussion of creating a thread safe version of HLL++ but the general consensus has been that synchronization is better done at the client level. Let me know if you have strong opinions on this.
          Hide
          Jonathan Ellis added a comment -

          The approach looks good to me. I vote we ship it and have QA test a few different kinds of compactions to make sure the error rate stays low, but it looks like the defaults work pretty damn well. (And we should be okay on thread safety; only one thread builds it, after which it's immutable.)

          Two comments –

          1. What's going on with offer that we need to getArray instead of handing it the ByteBuffer? Would prefer to fix that in HLL rather than copy out the array.
          2. Is 50K per sstable acceptable for LCS? that's 500MB if we have 10k sstables which is within our goal of 5+ TB. I'd be more comfortable if we pull this in only at compaction time. Note that we have precedent for doing this in the ancestors (CASSANDRA-5342) but if we keep adding "compaction time" metadata the same way then things are going to get messy; some cleanup is probably in order.
          Show
          Jonathan Ellis added a comment - The approach looks good to me. I vote we ship it and have QA test a few different kinds of compactions to make sure the error rate stays low, but it looks like the defaults work pretty damn well. (And we should be okay on thread safety; only one thread builds it, after which it's immutable.) Two comments – What's going on with offer that we need to getArray instead of handing it the ByteBuffer? Would prefer to fix that in HLL rather than copy out the array. Is 50K per sstable acceptable for LCS? that's 500MB if we have 10k sstables which is within our goal of 5+ TB. I'd be more comfortable if we pull this in only at compaction time. Note that we have precedent for doing this in the ancestors ( CASSANDRA-5342 ) but if we keep adding "compaction time" metadata the same way then things are going to get messy; some cleanup is probably in order.
          Hide
          Matt Abrams added a comment -

          Answers:

          1. Since ByteBuffer's hashCode is only a function of the number of bits remaining we cannot use it directly in the offer function. If we implemented a helper function in HLL+ we would still need to get the byte array or some other derivative of the object that would be useful as a hash input. Let me know what you are looking for here and we can add to stream-lib.

          2. The size of the HLL is a function of how precise you need it to be. If we use a p of 15 instead of 16 the size drops to 21K. Inserting the same 500K elements into a HLL+ with p=15 yields of .58% in my tests. The nice thing about HLLs is that their error is only a function of how big the HLL is, not how many elements you put into it. For example, with the same p=15 HLL+ instance I inserted 1 billion elements and still had an error of .49%. This error rate starts to break down when the number of elements inserted becomes massive (high billions or low trillions). It is possible to build a HLL that deals with that scale but it hasn't been an issue for us yet.

          Question:

          The constructor for the HLL+ only sets a p value, not the sp (sparse p value). The advantage of using an sp is that for smaller tables you will get a near exact count. I'm not sure if this is required for this use case but I thought I'd mention it just in case.

          Show
          Matt Abrams added a comment - Answers: 1. Since ByteBuffer's hashCode is only a function of the number of bits remaining we cannot use it directly in the offer function. If we implemented a helper function in HLL+ we would still need to get the byte array or some other derivative of the object that would be useful as a hash input. Let me know what you are looking for here and we can add to stream-lib. 2. The size of the HLL is a function of how precise you need it to be. If we use a p of 15 instead of 16 the size drops to 21K. Inserting the same 500K elements into a HLL+ with p=15 yields of .58% in my tests. The nice thing about HLLs is that their error is only a function of how big the HLL is, not how many elements you put into it. For example, with the same p=15 HLL+ instance I inserted 1 billion elements and still had an error of .49%. This error rate starts to break down when the number of elements inserted becomes massive (high billions or low trillions). It is possible to build a HLL that deals with that scale but it hasn't been an issue for us yet. Question: The constructor for the HLL+ only sets a p value, not the sp (sparse p value). The advantage of using an sp is that for smaller tables you will get a near exact count. I'm not sure if this is required for this use case but I thought I'd mention it just in case.
          Hide
          Jonathan Ellis added a comment - - edited

          Since ByteBuffer's hashCode is only a function of the number of bits remaining we cannot use it directly in the offer function.

          I don't follow – that should be exactly the desired behavior. The ByteBuffer offset/remaining are telling us, "this is the part of the backing array that we're interested in," which lets us "split up" regions of memory without having to actually copy to new arrays. So BBU.getArray is only for when some API only allows arrays and possibly having to perform a copy is the only alternative:

          /**
               * You should almost never use this.  Instead, use the write* methods to avoid copies.
               */
              public static byte[] getArray(ByteBuffer buffer)
              {
                  int length = buffer.remaining();
          
                  if (buffer.hasArray())
                  {
                      int boff = buffer.arrayOffset() + buffer.position();
                      if (boff == 0 && length == buffer.array().length)
                          return buffer.array();
                      else
                          return Arrays.copyOfRange(buffer.array(), boff, boff + length);
                  }
                  // else, DirectByteBuffer.get() is the fastest route
                  byte[] bytes = new byte[length];
                  buffer.duplicate().get(bytes);
          
                  return bytes;
              }
          

          The size of the HLL is a function of how precise you need it to be. If we use a p of 15 instead of 16 the size drops to 21K. Inserting the same 500K elements into a HLL+ with p=15 yields of .58% in my tests.

          So, we can trade a factor of 2 size for roughly a factor of 2 precision?. Unless we have a use for keeping these on heap that I can't think of, I'd say we should double the size and only read them in for compaction.

          Show
          Jonathan Ellis added a comment - - edited Since ByteBuffer's hashCode is only a function of the number of bits remaining we cannot use it directly in the offer function. I don't follow – that should be exactly the desired behavior. The ByteBuffer offset/remaining are telling us, "this is the part of the backing array that we're interested in," which lets us "split up" regions of memory without having to actually copy to new arrays. So BBU.getArray is only for when some API only allows arrays and possibly having to perform a copy is the only alternative: /** * You should almost never use this . Instead, use the write* methods to avoid copies. */ public static byte [] getArray(ByteBuffer buffer) { int length = buffer.remaining(); if (buffer.hasArray()) { int boff = buffer.arrayOffset() + buffer.position(); if (boff == 0 && length == buffer.array().length) return buffer.array(); else return Arrays.copyOfRange(buffer.array(), boff, boff + length); } // else , DirectByteBuffer.get() is the fastest route byte [] bytes = new byte [length]; buffer.duplicate().get(bytes); return bytes; } The size of the HLL is a function of how precise you need it to be. If we use a p of 15 instead of 16 the size drops to 21K. Inserting the same 500K elements into a HLL+ with p=15 yields of .58% in my tests. So, we can trade a factor of 2 size for roughly a factor of 2 precision?. Unless we have a use for keeping these on heap that I can't think of, I'd say we should double the size and only read them in for compaction.
          Hide
          Yuki Morishita added a comment -

          v2 pushed to: https://github.com/yukim/cassandra/commits/5906-v2

          HLL++ support for out-of-library hashing is merged(https://github.com/clearspring/stream-lib/pull/50), but not officially released yet, so v2 contains custom built stream-lib.
          HLL++ parameter is unchanged from first version.

          Other change I made was to let SSTableMetadata to support reading HLL++ on compaction time only. To make this change a little bit easier, I also rewrite mutateLevel not to deserialize SSTableMetadata object(https://github.com/yukim/cassandra/commit/016c89b68ba74ca15fcac9fa6e6c37faeaee7bcd).

          Show
          Yuki Morishita added a comment - v2 pushed to: https://github.com/yukim/cassandra/commits/5906-v2 HLL++ support for out-of-library hashing is merged( https://github.com/clearspring/stream-lib/pull/50 ), but not officially released yet, so v2 contains custom built stream-lib. HLL++ parameter is unchanged from first version. Other change I made was to let SSTableMetadata to support reading HLL++ on compaction time only. To make this change a little bit easier, I also rewrite mutateLevel not to deserialize SSTableMetadata object( https://github.com/yukim/cassandra/commit/016c89b68ba74ca15fcac9fa6e6c37faeaee7bcd ).
          Hide
          Jonathan Ellis added a comment -

          That's starting to get to be a lot of deserialization code. Would it be simpler to split the only-at-compaction stuff into a separate Component?

          Show
          Jonathan Ellis added a comment - That's starting to get to be a lot of deserialization code. Would it be simpler to split the only-at-compaction stuff into a separate Component?
          Hide
          Jonathan Ellis added a comment -

          Also, this is more involved than I expected at first. Should we target for 2.1?

          Show
          Jonathan Ellis added a comment - Also, this is more involved than I expected at first. Should we target for 2.1?
          Hide
          Yuki Morishita added a comment -

          Should we target for 2.1?

          That's what I'm thinking right now.
          It will be much better if we can re-design, re-organize SSTableMetadata serialization.
          There are some other properties that we only use one time like partitioner name.

          Show
          Yuki Morishita added a comment - Should we target for 2.1? That's what I'm thinking right now. It will be much better if we can re-design, re-organize SSTableMetadata serialization. There are some other properties that we only use one time like partitioner name.
          Hide
          Chris Burroughs added a comment -

          FYI released a version of stream-lib today.

          Show
          Chris Burroughs added a comment - FYI released a version of stream-lib today.
          Hide
          Yuki Morishita added a comment -

          Push this to be released on 2.1, after CASSANDRA-6356

          Show
          Yuki Morishita added a comment - Push this to be released on 2.1, after CASSANDRA-6356
          Hide
          Yuki Morishita added a comment -

          So far, I tested HLL++ alone for serialized size and error% with various parameters.
          https://docs.google.com/a/datastax.com/spreadsheet/ccc?key=0AsVe14L_ijtkdEhDbk1rTjYwb3ZjdXFlTnNCNnk2cGc#gid=13

          We can reduce the size from originally posted here (p=16, sp=0), down to less than 10k for p=13, sp=25. Using the sparse mode, we can save space for smaller number of partitions.
          I think relative error 2% of estimated partition size is tolerable for constructing bloom filter. (though I don't have formula to prove it )

          Show
          Yuki Morishita added a comment - So far, I tested HLL++ alone for serialized size and error% with various parameters. https://docs.google.com/a/datastax.com/spreadsheet/ccc?key=0AsVe14L_ijtkdEhDbk1rTjYwb3ZjdXFlTnNCNnk2cGc#gid=13 We can reduce the size from originally posted here (p=16, sp=0), down to less than 10k for p=13, sp=25. Using the sparse mode, we can save space for smaller number of partitions. I think relative error 2% of estimated partition size is tolerable for constructing bloom filter. (though I don't have formula to prove it )
          Hide
          Jonathan Ellis added a comment -

          Why does HLL size spike around 10k elements?

          Show
          Jonathan Ellis added a comment - Why does HLL size spike around 10k elements?
          Hide
          Yuki Morishita added a comment -

          I think it should be transition from sparse mode to normal mode before going over maximum register normal mode uses.
          I'm reading the code with the paper in one hand...

          ping Chris Burroughs or Matt Abrams...

          Show
          Yuki Morishita added a comment - I think it should be transition from sparse mode to normal mode before going over maximum register normal mode uses. I'm reading the code with the paper in one hand... ping Chris Burroughs or Matt Abrams ...
          Hide
          Matt Abrams added a comment -

          When SP > 0 the algorithm uses a variant of a linear counter to get very accurate counts at small cardinality. At some threshold the algorithm switches from a linear counter to HLL. Linear counters grow in size as a function of the number of inputs where HLL's size is a function of the desired error rate. We could (should?) tune the threshold so that the size so that the conversion happens earlier. Currently the threshold is equal to 2^p * .75.

          Show
          Matt Abrams added a comment - When SP > 0 the algorithm uses a variant of a linear counter to get very accurate counts at small cardinality. At some threshold the algorithm switches from a linear counter to HLL. Linear counters grow in size as a function of the number of inputs where HLL's size is a function of the desired error rate. We could (should?) tune the threshold so that the size so that the conversion happens earlier. Currently the threshold is equal to 2^p * .75.
          Hide
          Matt Abrams added a comment -

          We recently resolved a concurrency bug and improved performance of HLL++. We'll push a new release of stream-lib soon and I recommend using the updated version for this patch.

          Show
          Matt Abrams added a comment - We recently resolved a concurrency bug and improved performance of HLL++. We'll push a new release of stream-lib soon and I recommend using the updated version for this patch.
          Hide
          Matt Abrams added a comment -
          Show
          Matt Abrams added a comment - Link to bug I mentioned in previous comment: https://github.com/addthis/stream-lib/commit/5a9e9cc7f71e1eb49635df29bba67d905903f1b8
          Hide
          Yuki Morishita added a comment -

          (also: https://github.com/yukim/cassandra/tree/5906-v3)

          Attaching patch for review.

          • implemented on top of CASSANDRA-6356
          • updated stream-lib to v2.5.1 (latest)

          HLL++ parameters are p=13, sp=25 from my observation above.

          Show
          Yuki Morishita added a comment - (also: https://github.com/yukim/cassandra/tree/5906-v3 ) Attaching patch for review. implemented on top of CASSANDRA-6356 updated stream-lib to v2.5.1 (latest) HLL++ parameters are p=13, sp=25 from my observation above.
          Hide
          Jonathan Ellis added a comment -

          Most changes to CompactionTask are cosmetic – split out to a separate commit?

          Why would metadata.cardinalityEstimator be null if we've already checked newStatsFile?

          Otherwise +1.

          Show
          Jonathan Ellis added a comment - Most changes to CompactionTask are cosmetic – split out to a separate commit? Why would metadata.cardinalityEstimator be null if we've already checked newStatsFile ? Otherwise +1.
          Hide
          Yuki Morishita added a comment -

          Committed with fix bellow:

          • Cosmetic changes are in separate commit. (388cbfae0c08cb1664bed52b044062ff5d6db617)
          • remove null check on cardinalityEstimator as it isn't necessary
          Show
          Yuki Morishita added a comment - Committed with fix bellow: Cosmetic changes are in separate commit. (388cbfae0c08cb1664bed52b044062ff5d6db617) remove null check on cardinalityEstimator as it isn't necessary

            People

            • Assignee:
              Yuki Morishita
              Reporter:
              Jonathan Ellis
              Reviewer:
              Jonathan Ellis
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development