HBase
  1. HBase
  2. HBASE-19

CountingBloomFilter can overflow its storage

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.2.0
    • Component/s: None
    • Labels:
      None

      Description

      The org.onelab.filter.CountingBloomFilter implementation does not check the value of a bucket before incrementing/decrementing it. The buckets in a Counting Bloom filter must not be allowed to overflow, and if they reach their maximum value, they must not be allowed to decrement. This is the only way to preserve the assumptions of the filter (without larger buckets). See: http://en.wikipedia.org/wiki/Bloom_filter#Counting_filters

      Currently, if enough values hash to a bucket, the CountingBloomFilter may begin reporting false negatives when it wraps back around to 0.

      1. 19.patch
        7 kB
        Bryan Duxbury
      2. 19-v2.patch
        7 kB
        Bryan Duxbury
      3. counting-overflow.patch
        0.9 kB
        Stu Hood
      4. counting-overflow-fourbit.patch
        7 kB
        Stu Hood
      5. counting-overflow-fourbit.patch
        7 kB
        Stu Hood

        Activity

        Hide
        stack added a comment -

        Committed. Thanks Stu (And Bryan for fixing patch to apply in new location).

        Show
        stack added a comment - Committed. Thanks Stu (And Bryan for fixing patch to apply in new location).
        Hide
        Bryan Duxbury added a comment -

        Replaced a poorly sed'd patch.

        Show
        Bryan Duxbury added a comment - Replaced a poorly sed'd patch.
        Hide
        Bryan Duxbury added a comment -

        Updated to apply to new svn.

        Show
        Bryan Duxbury added a comment - Updated to apply to new svn.
        Hide
        stack added a comment -

        Lets apply the patch. It fixes the broken counting bloom filter implementation (though this filter type doesn't look to be directly useful in hbase going forward).

        Show
        stack added a comment - Lets apply the patch. It fixes the broken counting bloom filter implementation (though this filter type doesn't look to be directly useful in hbase going forward).
        Hide
        Bryan Duxbury added a comment -

        We've been discussing making bloomfilters mandatory on mapfiles. Also, the plan would be to use standard bloomfilters, since we'd only create them when we're flushing or compacting, and never have to delete from them. In light of that, should we bother with this issue, or just resolve it Won'tFix?

        Show
        Bryan Duxbury added a comment - We've been discussing making bloomfilters mandatory on mapfiles. Also, the plan would be to use standard bloomfilters, since we'd only create them when we're flushing or compacting, and never have to delete from them. In light of that, should we bother with this issue, or just resolve it Won'tFix?
        Hide
        stack added a comment -

        Patch looks good Stu. Let me apply it after we make the 0.16 branch. Currently, only blockers are to be applied (There is to be a vote on 0.16 on the 24th).

        Show
        stack added a comment - Patch looks good Stu. Let me apply it after we make the 0.16 branch. Currently, only blockers are to be applied (There is to be a vote on 0.16 on the 24th).
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12373593/counting-overflow-fourbit.patch
        against trunk revision r613499.

        @author +1. The patch does not contain any @author tags.

        javadoc +1. The javadoc tool did not generate any warning messages.

        javac +1. The applied patch does not generate any new compiler warnings.

        findbugs +1. The patch does not introduce any new Findbugs warnings.

        core tests +1. The patch passed core unit tests.

        contrib tests +1. The patch passed contrib unit tests.

        Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1664/testReport/
        Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1664/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1664/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1664/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12373593/counting-overflow-fourbit.patch against trunk revision r613499. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1664/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1664/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1664/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1664/console This message is automatically generated.
        Hide
        Stu Hood added a comment -

        Use (long[]).clone() rather than Arrays.copyOf.

        Show
        Stu Hood added a comment - Use (long[]).clone() rather than Arrays.copyOf.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12373592/counting-overflow-fourbit.patch
        against trunk revision r613359.

        @author +1. The patch does not contain any @author tags.

        javadoc +1. The javadoc tool did not generate any warning messages.

        javac +1. The applied patch does not generate any new compiler warnings.

        findbugs -1. The patch appears to cause Findbugs to fail.

        core tests -1. The patch failed core unit tests.

        contrib tests -1. The patch failed contrib unit tests.

        Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1646/testReport/
        Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1646/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1646/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12373592/counting-overflow-fourbit.patch against trunk revision r613359. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs -1. The patch appears to cause Findbugs to fail. core tests -1. The patch failed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1646/testReport/ Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1646/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1646/console This message is automatically generated.
        Hide
        Stu Hood added a comment -

        This patch implements the CountingBloomFilter using 4bit buckets in a long[]. Since the majority of the time is taken in the hash function, it runs at approximately the same speed as BloomFilter.

        Show
        Stu Hood added a comment - This patch implements the CountingBloomFilter using 4bit buckets in a long[]. Since the majority of the time is taken in the hash function, it runs at approximately the same speed as BloomFilter.
        Hide
        stack added a comment -

        That would be great Stu

        Show
        stack added a comment - That would be great Stu
        Hide
        Stu Hood added a comment -

        Also, I just noticed that this patch does not fully resolve the issue. The Filter.and(Filter) and Filter.or(Filter) perform AND and OR on the bytes respectively, which can result in negative values in the resulting CountingBloomFilter.

        I think I'll write a patch to replace the byte storage with 4bit so that we can use the full range of the buckets.

        Show
        Stu Hood added a comment - Also, I just noticed that this patch does not fully resolve the issue. The Filter.and(Filter) and Filter.or(Filter) perform AND and OR on the bytes respectively, which can result in negative values in the resulting CountingBloomFilter. I think I'll write a patch to replace the byte storage with 4bit so that we can use the full range of the buckets.
        Hide
        Stu Hood added a comment -

        Yea, it is byte-wide: the storage in this implementation is an array of bytes, each byte representing a bucket.

        On a related topic, I think it would be a nice improvement to convert from byte buckets to 4bit buckets as suggested in the article. Currently, we are only effectively using 7bits of each bucket, because we are treating the value as unsigned (0-127). It would also obviously be a space improvement.

        Show
        Stu Hood added a comment - Yea, it is byte-wide: the storage in this implementation is an array of bytes, each byte representing a bucket. On a related topic, I think it would be a nice improvement to convert from byte buckets to 4bit buckets as suggested in the article. Currently, we are only effectively using 7bits of each bucket, because we are treating the value as unsigned (0-127). It would also obviously be a space improvement.
        Hide
        stack added a comment -

        Patch looks good Stu. Does it say anywhere that this counter is byte-wide? (I don't see it on a quick check of the code) Is this value that is being incremented, the "3 or 4 bit counter" that is referred to in the wikipedia article? Thanks.

        Show
        stack added a comment - Patch looks good Stu. Does it say anywhere that this counter is byte-wide? (I don't see it on a quick check of the code) Is this value that is being incremented, the "3 or 4 bit counter" that is referred to in the wikipedia article? Thanks.
        Hide
        Stu Hood added a comment -

        Here is a patch that compares the value of each bucket to Byte.MAX_VALUE to ensure that buckets stay in range.

        Show
        Stu Hood added a comment - Here is a patch that compares the value of each bucket to Byte.MAX_VALUE to ensure that buckets stay in range.

          People

          • Assignee:
            Unassigned
            Reporter:
            Stu Hood
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development