Hadoop Common
  1. Hadoop Common
  2. HADOOP-3063

BloomMapFile - fail-fast version of MapFile for sparsely populated key space

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.0
    • Fix Version/s: 0.20.0
    • Component/s: io
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Introduced BloomMapFile subclass of MapFile that creates a Bloom filter from all keys.

      Description

      The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.

      This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.

      Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.

      1. bloommap-v4.patch
        101 kB
        Andrzej Bialecki
      2. bloommap-v3.patch
        96 kB
        Andrzej Bialecki
      3. bloommap-v2.patch
        77 kB
        Andrzej Bialecki
      4. bloommap.patch
        87 kB
        Andrzej Bialecki

        Issue Links

          Activity

          Hide
          Robert Chansler added a comment -

          Edit release note for publication.

          Show
          Robert Chansler added a comment - Edit release note for publication.
          Hide
          stack added a comment -

          Committed on behalf of Andrzej. He's swamped today and he (and I) wanted to get this in before 0.20.0 feature-freeze.

          Show
          stack added a comment - Committed on behalf of Andrzej. He's swamped today and he (and I) wanted to get this in before 0.20.0 feature-freeze.
          Hide
          stack added a comment -

          The failing tests seem unrelated to this patch:

          org.apache.hadoop.chukwa.datacollection.adaptor.filetailer.TestFileTailingAdaptors.testLogRotate
          org.apache.hadoop.hdfs.TestFileAppend2.testComplexAppend
          

          IMO, these failures shouldn't hold up committing this patch.

          Show
          stack added a comment - The failing tests seem unrelated to this patch: org.apache.hadoop.chukwa.datacollection.adaptor.filetailer.TestFileTailingAdaptors.testLogRotate org.apache.hadoop.hdfs.TestFileAppend2.testComplexAppend IMO, these failures shouldn't hold up committing this patch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12395994/bloommap-v4.patch
          against trunk revision 726129.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 4 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3739/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3739/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3739/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3739/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12395994/bloommap-v4.patch against trunk revision 726129. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3739/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3739/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3739/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3739/console This message is automatically generated.
          Hide
          stack added a comment -

          +1 Test now passes locally with v4. Took quick look at patch. Looks good to me.

          Show
          stack added a comment - +1 Test now passes locally with v4. Took quick look at patch. Looks good to me.
          Hide
          Andrzej Bialecki added a comment -

          Apart from fumbling the args passed to the constructor, the unit test failed for a valid reason - the calculation of BloomFilter size was wrong. I fixed the calculation and added a config knob to adjust the error rate.

          This time it passes the tests and gets +1's from ant test-patch.

          Show
          Andrzej Bialecki added a comment - Apart from fumbling the args passed to the constructor, the unit test failed for a valid reason - the calculation of BloomFilter size was wrong. I fixed the calculation and added a config knob to adjust the error rate. This time it passes the tests and gets +1's from ant test-patch.
          Hide
          stack added a comment -

          Reviewing the patch, it looks great. The above failure would seem to be because the last two arguments need to be flipped:

           236 +      bloomFilter = new DynamicBloomFilter(vectorSize, HASH_COUNT, numKeys,
           237 +          Hash.getHashType(conf));
          

          ....the DBF constructor looks like this:

          1528 +  public DynamicBloomFilter(int vectorSize, int nbHash, int hashType, int nr) {
          

          Test still fails with an assertion error so maybe the above is not right.

          There is an hbasism still in the code that you probably want to remove:

          408 +    String name = conf.get("hbase.hash.type", "murmur");
          
          Show
          stack added a comment - Reviewing the patch, it looks great. The above failure would seem to be because the last two arguments need to be flipped: 236 + bloomFilter = new DynamicBloomFilter(vectorSize, HASH_COUNT, numKeys, 237 + Hash.getHashType(conf)); ....the DBF constructor looks like this: 1528 + public DynamicBloomFilter( int vectorSize, int nbHash, int hashType, int nr) { Test still fails with an assertion error so maybe the above is not right. There is an hbasism still in the code that you probably want to remove: 408 + String name = conf.get( "hbase.hash.type" , "murmur" );
          Hide
          stack added a comment -

          Andrzej: It failed its own unit test up on hudson org.apache.hadoop.io.TestBloomMapFile.testMembershipTest.

          Error Message
          
          hashType must be known
          
          Stacktrace
          
          java.lang.IllegalArgumentException: hashType must be known
          	at org.apache.hadoop.util.bloom.HashFunction.<init>(HashFunction.java:92)
          	at org.apache.hadoop.util.bloom.Filter.<init>(Filter.java:102)
          	at org.apache.hadoop.util.bloom.DynamicBloomFilter.<init>(DynamicBloomFilter.java:121)
          	at org.apache.hadoop.io.BloomMapFile$Writer.initBloomFilter(BloomMapFile.java:150)
          	at org.apache.hadoop.io.BloomMapFile$Writer.<init>(BloomMapFile.java:143)
          	at org.apache.hadoop.io.TestBloomMapFile.testMembershipTest(TestBloomMapFile.java:37)
          

          I ran it locally and got same failure.

          Show
          stack added a comment - Andrzej: It failed its own unit test up on hudson org.apache.hadoop.io.TestBloomMapFile.testMembershipTest. Error Message hashType must be known Stacktrace java.lang.IllegalArgumentException: hashType must be known at org.apache.hadoop.util.bloom.HashFunction.<init>(HashFunction.java:92) at org.apache.hadoop.util.bloom.Filter.<init>(Filter.java:102) at org.apache.hadoop.util.bloom.DynamicBloomFilter.<init>(DynamicBloomFilter.java:121) at org.apache.hadoop.io.BloomMapFile$Writer.initBloomFilter(BloomMapFile.java:150) at org.apache.hadoop.io.BloomMapFile$Writer.<init>(BloomMapFile.java:143) at org.apache.hadoop.io.TestBloomMapFile.testMembershipTest(TestBloomMapFile.java:37) I ran it locally and got same failure.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12395548/bloommap-v3.patch
          against trunk revision 725729.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 4 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3709/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3709/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3709/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3709/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12395548/bloommap-v3.patch against trunk revision 725729. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3709/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3709/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3709/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3709/console This message is automatically generated.
          Hide
          Andrzej Bialecki added a comment -

          Any chance that this issue makes it into 0.20 ?

          Show
          Andrzej Bialecki added a comment - Any chance that this issue makes it into 0.20 ?
          Hide
          Andrzej Bialecki added a comment -

          Updated patch.

          Show
          Andrzej Bialecki added a comment - Updated patch.
          Hide
          Andrzej Bialecki added a comment -

          This patch uses the latest versions of Bloom filters and Hash implementations from HBase. It successfully passes the 'ant test-patch' command:

          [exec] +1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 4 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          [exec]
          [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          Show
          Andrzej Bialecki added a comment - This patch uses the latest versions of Bloom filters and Hash implementations from HBase. It successfully passes the 'ant test-patch' command: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 4 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
          Hide
          Edward Bruce Williams added a comment -

          Thanks, refactoring of the onelab stuff into a hadoop util.bloomfilters would cleanup some awkward code tangles in hbase.

          Show
          Edward Bruce Williams added a comment - Thanks, refactoring of the onelab stuff into a hadoop util.bloomfilters would cleanup some awkward code tangles in hbase.
          Hide
          Andrzej Bialecki added a comment -

          Sorry, I got involved in other stuff - I'd like to prepare a new patch in the next day or two, perhaps we could squeeze it into 0.20 ... For the purpose of this patch I'll take the latest BloomFilter-s as they are now in HBase (plus the MurmurHash).

          Show
          Andrzej Bialecki added a comment - Sorry, I got involved in other stuff - I'd like to prepare a new patch in the next day or two, perhaps we could squeeze it into 0.20 ... For the purpose of this patch I'll take the latest BloomFilter-s as they are now in HBase (plus the MurmurHash).
          Hide
          Edward Bruce Williams added a comment -

          What is the status on this? I was assigned to fix some bugs in the Hbase Bloom Filter code, but moving the code back into Hadoop and removing it from HBase needs to be done first. Can I consider HBase 533 unblocked? There is also a new hashing mechanism now that was added to hbase by Andrzej, MurmurHash, that should also be moved back up into hadoop.

          Show
          Edward Bruce Williams added a comment - What is the status on this? I was assigned to fix some bugs in the Hbase Bloom Filter code, but moving the code back into Hadoop and removing it from HBase needs to be done first. Can I consider HBase 533 unblocked? There is also a new hashing mechanism now that was added to hbase by Andrzej, MurmurHash, that should also be moved back up into hadoop.
          Hide
          Owen O'Malley added a comment -

          It is complaining that your clone methods don't call the super.clone, which is generally considered good style. (So in your case, Filter needs a clone method that copies its fields that is called by the subtypes.)

          Yes, please add synchronization. It is far easier to add synchronization to a method that deal with missing synchronization.

          You also need to address the javac and javadoc warnings.

          Show
          Owen O'Malley added a comment - It is complaining that your clone methods don't call the super.clone, which is generally considered good style. (So in your case, Filter needs a clone method that copies its fields that is called by the subtypes.) Yes, please add synchronization. It is far easier to add synchronization to a method that deal with missing synchronization. You also need to address the javac and javadoc warnings.
          Hide
          Andrzej Bialecki added a comment -

          I'm not sure what to do about the Findbugs warnings. There are two of them, and both don't make sense to me.

          • CN_IDIOM_NO_SUPER_CALL: clone method does not call super.clone() in BloomFilter.clone() and in DynamicBloomFilter.clone() - how am I supposed to use Object.clone() here?
          • IS2_INCONSISTENT_SYNC: Inconsistent synchronization in BloomMapFile.initBloomFilter() - this access is not synchronized because it's called from the constructor. Other accesses are synchronized because the MapFile methods require synchronization. Should I add "synchronized" to initBloomFilter(), even though it's not needed there?
          Show
          Andrzej Bialecki added a comment - I'm not sure what to do about the Findbugs warnings. There are two of them, and both don't make sense to me. CN_IDIOM_NO_SUPER_CALL: clone method does not call super.clone() in BloomFilter.clone() and in DynamicBloomFilter.clone() - how am I supposed to use Object.clone() here? IS2_INCONSISTENT_SYNC: Inconsistent synchronization in BloomMapFile.initBloomFilter() - this access is not synchronized because it's called from the constructor. Other accesses are synchronized because the MapFile methods require synchronization. Should I add "synchronized" to initBloomFilter(), even though it's not needed there?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12378873/bloommap-v2.patch
          against trunk revision 619744.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 4 new or modified tests.

          javadoc -1. The javadoc tool appears to have generated 1 warning messages.

          javac -1. The applied patch generated 579 javac compiler warnings (more than the trunk's current 568 warnings).

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs -1. The patch appears to introduce 3 new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2098/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2098/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2098/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2098/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12378873/bloommap-v2.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 4 new or modified tests. javadoc -1. The javadoc tool appears to have generated 1 warning messages. javac -1. The applied patch generated 579 javac compiler warnings (more than the trunk's current 568 warnings). release audit +1. The applied patch does not generate any new release audit warnings. findbugs -1. The patch appears to introduce 3 new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2098/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2098/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2098/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2098/console This message is automatically generated.
          Hide
          Andrzej Bialecki added a comment -

          Updated patch. This patch imports the Bloom filter classes into org.apache.hadoop.util.bloom, and adds a notice to LICENSE.txt.

          Show
          Andrzej Bialecki added a comment - Updated patch. This patch imports the Bloom filter classes into org.apache.hadoop.util.bloom, and adds a notice to LICENSE.txt.
          Hide
          Doug Cutting added a comment -

          > I think it would be best to import these classes into org.apache.hadoop.util.bloom package [ ...]

          +1

          The license should also be added to our top-level LICENSE.txt, in the style of

          http://svn.apache.org/repos/asf/httpd/httpd/trunk/LICENSE

          Show
          Doug Cutting added a comment - > I think it would be best to import these classes into org.apache.hadoop.util.bloom package [ ...] +1 The license should also be added to our top-level LICENSE.txt, in the style of http://svn.apache.org/repos/asf/httpd/httpd/trunk/LICENSE
          Hide
          Jim Kellerman added a comment -

          We can't import them as a Jar because they implement Writable

          Show
          Jim Kellerman added a comment - We can't import them as a Jar because they implement Writable
          Hide
          Andrzej Bialecki added a comment -

          Re: replicating the source file - I agree, I just wasn't sure how to solve this.

          My experience with these classes is that they are not well-debugged, so often bug fixing is necessary. The original creator was an EU IST project, so it's unlikely we could expect any maintenance from that project. Therefore I don't think that packaging them as a third-party jar is a good option.

          I think it would be best to import these classes into org.apache.hadoop.util.bloom package, and keep the comment about the original authors in the javadoc, as we do it now, and subsequently remove these classes form HBase.

          Show
          Andrzej Bialecki added a comment - Re: replicating the source file - I agree, I just wasn't sure how to solve this. My experience with these classes is that they are not well-debugged, so often bug fixing is necessary. The original creator was an EU IST project, so it's unlikely we could expect any maintenance from that project. Therefore I don't think that packaging them as a third-party jar is a good option. I think it would be best to import these classes into org.apache.hadoop.util.bloom package, and keep the comment about the original authors in the javadoc, as we do it now, and subsequently remove these classes form HBase.
          Hide
          Doug Cutting added a comment -

          We should avoid replicating source files. So files are copied from HBase to Core, then they should ideally then be removed from HBase, since HBase relies on Core.

          As for the org.onelab classes: shouldn't we import these as a jar rather than as source?

          Show
          Doug Cutting added a comment - We should avoid replicating source files. So files are copied from HBase to Core, then they should ideally then be removed from HBase, since HBase relies on Core. As for the org.onelab classes: shouldn't we import these as a jar rather than as source?
          Hide
          Jim Kellerman added a comment -

          If you have fixes for BloomFilter and DynamicBloomFilter, please open a Jira for HBase https://issues.apache.org/jira/browse/HBASE and submit a patch there as well. Thanks.

          Show
          Jim Kellerman added a comment - If you have fixes for BloomFilter and DynamicBloomFilter, please open a Jira for HBase https://issues.apache.org/jira/browse/HBASE and submit a patch there as well. Thanks.
          Hide
          Andrzej Bialecki added a comment -

          BloomMapFile implementation and JUnit test.

          NOTE 1: I wasn't sure how to approach the issue of the org.onelab.* classes that I borrowed from HBase (which originally were a part of Hadoop core ). For now they are included verbatim here in this patch.

          NOTE 2: the BloomFilter and DynamicBloomFilter classes contained a few bugs related to their Writable (de)serialization, some of them related to specific assumptions about the environment, some others fatal under all conditions. This patch contains these fixes too.

          Show
          Andrzej Bialecki added a comment - BloomMapFile implementation and JUnit test. NOTE 1: I wasn't sure how to approach the issue of the org.onelab.* classes that I borrowed from HBase (which originally were a part of Hadoop core ). For now they are included verbatim here in this patch. NOTE 2: the BloomFilter and DynamicBloomFilter classes contained a few bugs related to their Writable (de)serialization, some of them related to specific assumptions about the environment, some others fatal under all conditions. This patch contains these fixes too.

            People

            • Assignee:
              Andrzej Bialecki
              Reporter:
              Andrzej Bialecki
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development