Cassandra
  1. Cassandra
  2. CASSANDRA-1555

Considerations for larger bloom filters

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 0.7.1
    • Component/s: Core
    • Labels:
      None

      Description

      To (optimally) support SSTables larger than 143 million keys, we need to support bloom filters larger than 2^31 bits, which java.util.BitSet can't handle directly.

      A few options:

      • Switch to a BitSet class which supports 2^31 * 64 bits (Lucene's OpenBitSet)
      • Partition the java.util.BitSet behind our current BloomFilter
        • Straightforward bit partitioning: bit N is in bitset N // 2^31
        • Separate equally sized complete bloom filters for member ranges, which can be used independently or OR'd together under memory pressure.

      All of these options require new approaches to serialization.

      1. 1555_v5.txt
        120 kB
        T Jake Luciani
      2. 1555_v6.txt
        119 kB
        T Jake Luciani
      3. 1555-v7.txt
        119 kB
        Jonathan Ellis
      4. cassandra-1555.tgz
        12 kB
        Ryan King
      5. CASSANDRA-1555v2.patch
        112 kB
        Ryan King
      6. CASSANDRA-1555v3.patch.gz
        20 kB
        Ryan King
      7. CASSANDRA-1555v4.patch.gz
        20 kB
        Ryan King

        Issue Links

          Activity

          Hide
          Stu Hood added a comment -
          Show
          Stu Hood added a comment - Also interesting: http://code.google.com/p/javaewah/
          Hide
          Jonathan Ellis added a comment -

          EWAH looks interesting. Would be interesting to see what that does for our BF.

          (More generally I suspect that degrading BF FP rate for extremely skinny rows can actually be the right thing to do.)

          Show
          Jonathan Ellis added a comment - EWAH looks interesting. Would be interesting to see what that does for our BF. (More generally I suspect that degrading BF FP rate for extremely skinny rows can actually be the right thing to do.)
          Hide
          Stu Hood added a comment -

          > EWAH looks interesting. Would be interesting to see what that does for our BF.
          I looked at it a bit more: it can't set() bits in random order (must ascend) and doesn't appear to support more than 2^31 bits. So we'd need another solution for filter creation, although we could compress into an EWAH at load time.

          > More generally I suspect that degrading BF FP rate for extremely skinny rows can actually be the right thing to do.
          rcoli posted a usecase for ~240 million rows in ~90GB of data: I wouldn't call ~400 bytes extremely skinny, but 90GB is way too early to be degrading performance.

          Show
          Stu Hood added a comment - > EWAH looks interesting. Would be interesting to see what that does for our BF. I looked at it a bit more: it can't set() bits in random order (must ascend) and doesn't appear to support more than 2^31 bits. So we'd need another solution for filter creation, although we could compress into an EWAH at load time. > More generally I suspect that degrading BF FP rate for extremely skinny rows can actually be the right thing to do. rcoli posted a usecase for ~240 million rows in ~90GB of data: I wouldn't call ~400 bytes extremely skinny, but 90GB is way too early to be degrading performance.
          Hide
          Sylvain Lebresne added a comment -

          Even if we should probably have a way to support big BF, it would still be nice to be able to
          configure the BF FP rate. 2+GB for BFs can start to be an heavy weight on your memory.
          For some loads, it could be better to actually keep your memory for row cache for instance.

          Show
          Sylvain Lebresne added a comment - Even if we should probably have a way to support big BF, it would still be nice to be able to configure the BF FP rate. 2+GB for BFs can start to be an heavy weight on your memory. For some loads, it could be better to actually keep your memory for row cache for instance.
          Hide
          Jonathan Ellis added a comment -

          90GB is way too early to be degrading performance

          Well, that depends. If we're "degrading" from 6.71e-05 to 0.00108 FP to save memory, then yes, that's definitely the right thing to do or even a 16GB heap will get exhausted pretty quickly. But if we're having to degrade to > 1% FP then that would be something to address.

          Show
          Jonathan Ellis added a comment - 90GB is way too early to be degrading performance Well, that depends. If we're "degrading" from 6.71e-05 to 0.00108 FP to save memory, then yes, that's definitely the right thing to do or even a 16GB heap will get exhausted pretty quickly. But if we're having to degrade to > 1% FP then that would be something to address.
          Hide
          Stu Hood added a comment -

          > If we're "degrading" from 6.71e-05 to 0.00108 FP to save memory, then yes, that's definitely the right thing to do
          They were actually at 0.0229 FP, but point taken.

          > it would still be nice to be able to configure the BF FP rate
          Perhaps. But, that doesn't excuse us from finding reasonable defaults.

          Show
          Stu Hood added a comment - > If we're "degrading" from 6.71e-05 to 0.00108 FP to save memory, then yes, that's definitely the right thing to do They were actually at 0.0229 FP, but point taken. > it would still be nice to be able to configure the BF FP rate Perhaps. But, that doesn't excuse us from finding reasonable defaults.
          Hide
          Ryan King added a comment - - edited

          We have a patch to deal with having larger bitsets. I'm in the process of making it into a patchset but wanted to make some comments and get feedback first.

          Its not based on the lucent bitset or EWAH for a couple reasons (which might not be totally legit):

          EWAH looks like it might be useful for reducing on-disk storage for BFs, but it doesn't appear to be useful for in-memory usage because the cost of checking a bit is linear, not constant. That seems like a serious performance degradation that's not even worth considering. And reducing the storage footprint for BFs seems like the least of our concerns at this point (not that we should never do it).

          Lucene's OpenBitSet is implemented as an array of longs and looks genuinely useful and would allow us to have BFs of up to 64*2**32-1 bits (274,877,906,943). Our implementation http://github.com/kakugawa/cassandra/compare/fbb7c26acde572abf625...cassandra-0.6-counts-bf#diff-5 uses an array of BitSets and can therefore handle up to 2**30*2**31 bits (2.30584301 x 10**18). That's quite a bit more bits, but I'm not sure if its worth it if the lucene approach can cover us.

          UPDATE: note that the lucene impl is 17 GB worth of data. I doubt we'd go over that anytime soon.

          Show
          Ryan King added a comment - - edited We have a patch to deal with having larger bitsets. I'm in the process of making it into a patchset but wanted to make some comments and get feedback first. Its not based on the lucent bitset or EWAH for a couple reasons (which might not be totally legit): EWAH looks like it might be useful for reducing on-disk storage for BFs, but it doesn't appear to be useful for in-memory usage because the cost of checking a bit is linear, not constant. That seems like a serious performance degradation that's not even worth considering. And reducing the storage footprint for BFs seems like the least of our concerns at this point (not that we should never do it). Lucene's OpenBitSet is implemented as an array of longs and looks genuinely useful and would allow us to have BFs of up to 64*2**32-1 bits (274,877,906,943). Our implementation http://github.com/kakugawa/cassandra/compare/fbb7c26acde572abf625...cassandra-0.6-counts-bf#diff-5 uses an array of BitSets and can therefore handle up to 2**30*2**31 bits (2.30584301 x 10**18). That's quite a bit more bits, but I'm not sure if its worth it if the lucene approach can cover us. UPDATE: note that the lucene impl is 17 GB worth of data. I doubt we'd go over that anytime soon.
          Hide
          Stu Hood added a comment -

          Well that's embarrassing.

          Show
          Stu Hood added a comment - Well that's embarrassing.
          Hide
          Ryan King added a comment -

          0001-openbitset.patch

          Port of Lucene's OpenBitSet (taken from patch on CASSANDRA-1472)

          0002-switch-bf-to-obs.patch

          Switch to use OBS in the BF. It should still be able to read BitSets from disk, but that needs more testing.

          Show
          Ryan King added a comment - 0001-openbitset.patch Port of Lucene's OpenBitSet (taken from patch on CASSANDRA-1472 ) 0002-switch-bf-to-obs.patch Switch to use OBS in the BF. It should still be able to read BitSets from disk, but that needs more testing.
          Hide
          T Jake Luciani added a comment -

          I'm not confident the murmur hash changes will produce unique enough numbers given you are xor'ing 32bits into 64bits

          I think we should update this to the 64 bit version: http://d3s.mff.cuni.cz/~holub/sw/javamurmurhash/MurmurHash.java

          Problem is then we break the BitSet backwards compatibility...

          Show
          T Jake Luciani added a comment - I'm not confident the murmur hash changes will produce unique enough numbers given you are xor'ing 32bits into 64bits I think we should update this to the 64 bit version: http://d3s.mff.cuni.cz/~holub/sw/javamurmurhash/MurmurHash.java Problem is then we break the BitSet backwards compatibility...
          Hide
          Ryan King added a comment -

          Yeah, you're right, we need to upgrade to the 64bit version.

          Our options with backward-compat seem to be:

          1) keep 2 separete BF impl's- each with their own bitset and hash function

          2) regenerate the BF the first time we start up with this patch

          I'd lean towards #2, but am open to other suggestions.

          Show
          Ryan King added a comment - Yeah, you're right, we need to upgrade to the 64bit version. Our options with backward-compat seem to be: 1) keep 2 separete BF impl's- each with their own bitset and hash function 2) regenerate the BF the first time we start up with this patch I'd lean towards #2, but am open to other suggestions.
          Hide
          T Jake Luciani added a comment -

          I'm leaning towards 1) keep both and upgrade to newer impl by performing a major compaction.

          The problem with 2) is for users with TBs of SSTables and regenerating on startup could take days...

          Show
          T Jake Luciani added a comment - I'm leaning towards 1) keep both and upgrade to newer impl by performing a major compaction. The problem with 2) is for users with TBs of SSTables and regenerating on startup could take days...
          Hide
          Ryan King added a comment -

          I think that's fair. I'll rework the patch.

          Show
          Ryan King added a comment - I think that's fair. I'll rework the patch.
          Hide
          Ryan King added a comment -

          Here's a patch that takes a better approach-

          It uses the SSTable version to tell which type of bloomfilter to use. In order to make this work I had to do some refactorings in the Iterators. There were a number of places where we were passing around CFMetaData objects, where a SSTableReader would be better because it would allow us to get at the Descriptor for that table. AFAICT all the callpoints had an SSTableReader available, so this refactoring was not very intrusive.

          The new thing is the BigBloomFilter, which uses OpenBitset and LongMurmurHash. Some of it is copy/paste from BloomFilter.

          All unit and system test pass, but this could use some more testing, for sure, especially around the upgrade path.

          Also, the LongMurmurHash seems to have more collisions. I'll see if I can figure out why.

          One other note: FilterTest became FilterTestHelper because it no longer has any test methods of its own.

          Show
          Ryan King added a comment - Here's a patch that takes a better approach- It uses the SSTable version to tell which type of bloomfilter to use. In order to make this work I had to do some refactorings in the Iterators. There were a number of places where we were passing around CFMetaData objects, where a SSTableReader would be better because it would allow us to get at the Descriptor for that table. AFAICT all the callpoints had an SSTableReader available, so this refactoring was not very intrusive. The new thing is the BigBloomFilter, which uses OpenBitset and LongMurmurHash. Some of it is copy/paste from BloomFilter. All unit and system test pass, but this could use some more testing, for sure, especially around the upgrade path. Also, the LongMurmurHash seems to have more collisions. I'll see if I can figure out why. One other note: FilterTest became FilterTestHelper because it no longer has any test methods of its own.
          Hide
          Stu Hood added a comment - - edited
          • If BloomFilter is likely to be deprecated in favor of BigBloomFilter, naming them LegacyBloomFilter and BloomFilter might reduce the surface area of future changes
          • Probably a good opportunity to improve the serialization of BigBloomFilter: Java serialization is very wasteful for space (each row would contain the string "org.apache.cassandra.utils.obs.OpenBitSet"). Instead, just serializing an OpenBitSet as a long[] and # of valid bits would be much better
          • (Big)BloomFilter
            • maxBucketsPerElement can be pushed up into Filter
            • getFilter could probably be pushed up to filter, or at least removed from BloomFilter
            • emptyBuckets is unused
          • Orphaned method BigBloomFilter.serializeBitSet
          • Indentation is off in SSTableReader and BigBloomFilter

          I'm working on a separate issue to refresh LegacySSTableTest to check the column-level bloom filters as well: see CASSANDRA-1822

          Show
          Stu Hood added a comment - - edited If BloomFilter is likely to be deprecated in favor of BigBloomFilter, naming them LegacyBloomFilter and BloomFilter might reduce the surface area of future changes Probably a good opportunity to improve the serialization of BigBloomFilter: Java serialization is very wasteful for space (each row would contain the string "org.apache.cassandra.utils.obs.OpenBitSet"). Instead, just serializing an OpenBitSet as a long[] and # of valid bits would be much better (Big)BloomFilter maxBucketsPerElement can be pushed up into Filter getFilter could probably be pushed up to filter, or at least removed from BloomFilter emptyBuckets is unused Orphaned method BigBloomFilter.serializeBitSet Indentation is off in SSTableReader and BigBloomFilter I'm working on a separate issue to refresh LegacySSTableTest to check the column-level bloom filters as well: see CASSANDRA-1822
          Hide
          Ryan King added a comment -

          New patch with several changes based on Stu's feedback:

          • renamed BloomFilter to LegacyBloomFilter and BigBloomFilter to BloomFilter
          • moved maxBucketsPerElement to BloomCalculations
          • removed emptybuckets
          • cleaned up formatting in SSTableReader and BigBloomFilter

          Finally I changed the serialization to read and write the long[] directly, which saves a lot of spaces for small filters (column filter for a 10 item row goes from 120 bytes to 16).

          Show
          Ryan King added a comment - New patch with several changes based on Stu's feedback: renamed BloomFilter to LegacyBloomFilter and BigBloomFilter to BloomFilter moved maxBucketsPerElement to BloomCalculations removed emptybuckets cleaned up formatting in SSTableReader and BigBloomFilter Finally I changed the serialization to read and write the long[] directly, which saves a lot of spaces for small filters (column filter for a 10 item row goes from 120 bytes to 16).
          Hide
          Stu Hood added a comment -

          Almost there... I'm attaching an addendum that I needed to get the long-running unit tests building. Once I got them running, they were reporting an exception: run `ant clean long-test` to reproduce.

          As is, this patch passes the row-level compatibility test on CASSANDRA-1822, so as soon as we figure out the false positive problem, I can give it a thumbs up.

          Show
          Stu Hood added a comment - Almost there... I'm attaching an addendum that I needed to get the long-running unit tests building. Once I got them running, they were reporting an exception: run `ant clean long-test` to reproduce. As is, this patch passes the row-level compatibility test on CASSANDRA-1822 , so as soon as we figure out the false positive problem, I can give it a thumbs up.
          Hide
          Ryan King added a comment -

          Another round to fix the long tests.

          And on the FP rate, it seems that its actually in expected ranges based on the table here: http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html, though we should probably double-check that math.

          Show
          Ryan King added a comment - Another round to fix the long tests. And on the FP rate, it seems that its actually in expected ranges based on the table here: http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html , though we should probably double-check that math.
          Hide
          T Jake Luciani added a comment -

          Fixed the murur hash problem (the issue was with use of bytebuffers)
          Re-factored the code a bit. Put hash32 and hash64 into MurmurHash class.

          Overall I'm happy with this implementation, especially the sstable descriptor approach.

          +1

          Stu, I wasn't able to apply your latest patch for tests could you rebase against v5?

          Show
          T Jake Luciani added a comment - Fixed the murur hash problem (the issue was with use of bytebuffers) Re-factored the code a bit. Put hash32 and hash64 into MurmurHash class. Overall I'm happy with this implementation, especially the sstable descriptor approach. +1 Stu, I wasn't able to apply your latest patch for tests could you rebase against v5?
          Hide
          Ryan King added a comment -

          Stu's last patch is incorporated (in spirit, I took a slightly different appraoch) in my latest.

          Show
          Ryan King added a comment - Stu's last patch is incorporated (in spirit, I took a slightly different appraoch) in my latest.
          Hide
          Jonathan Ellis added a comment -

          Can you rebase v5, Jake?

          Show
          Jonathan Ellis added a comment - Can you rebase v5, Jake?
          Hide
          T Jake Luciani added a comment -

          Rebased and fixed the higher collision rate in hash64(). The code wasn't ensuring k remained unsigned. Now the test passes with < 100 collisions.

          Show
          T Jake Luciani added a comment - Rebased and fixed the higher collision rate in hash64(). The code wasn't ensuring k remained unsigned. Now the test passes with < 100 collisions.
          Hide
          T Jake Luciani added a comment -

          re-rebased

          Show
          T Jake Luciani added a comment - re-rebased
          Hide
          Jonathan Ellis added a comment -

          v7 attached w/ import and code style cleanup.

          The CFMetadata -> SSTableReader change in the Filter classes looks gratuitous – I'm guessing this is b/c the patch started against an older version of the code, when that is what the signature was. Let's undo that unless there is a deeper significance I am missing.

          Show
          Jonathan Ellis added a comment - v7 attached w/ import and code style cleanup. The CFMetadata -> SSTableReader change in the Filter classes looks gratuitous – I'm guessing this is b/c the patch started against an older version of the code, when that is what the signature was. Let's undo that unless there is a deeper significance I am missing.
          Hide
          T Jake Luciani added a comment -

          The CFMetadata -> SSTableReader change is required because the Descriptor is part of the SSTableReader only and the Descriptor is where we know what type of bloom filter its using. See SSTableNamesIterator.read()

          Show
          T Jake Luciani added a comment - The CFMetadata -> SSTableReader change is required because the Descriptor is part of the SSTableReader only and the Descriptor is where we know what type of bloom filter its using. See SSTableNamesIterator.read()
          Hide
          Jonathan Ellis added a comment -

          rebased + committed

          Show
          Jonathan Ellis added a comment - rebased + committed
          Hide
          Hudson added a comment -

          Integrated in Cassandra-0.7 #116 (See https://hudson.apache.org/hudson/job/Cassandra-0.7/116/)
          add OpenBitSet to support larger bloom filters
          patch by Ryan King, Stu Hood, and tjake for CASSANDRA-1555

          Show
          Hudson added a comment - Integrated in Cassandra-0.7 #116 (See https://hudson.apache.org/hudson/job/Cassandra-0.7/116/ ) add OpenBitSet to support larger bloom filters patch by Ryan King, Stu Hood, and tjake for CASSANDRA-1555

            People

            • Assignee:
              Ryan King
              Reporter:
              Stu Hood
              Reviewer:
              T Jake Luciani
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development