Issue Details (XML | Word | Printable)

Key: HADOOP-3063
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Andrzej Bialecki
Reporter: Andrzej Bialecki
Votes: 0
Watchers: 8
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

BloomMapFile - fail-fast version of MapFile for sparsely populated key space

Created: 21/Mar/08 11:03 AM   Updated: 23/Apr/09 07:17 PM
Return to search
Component/s: io
Affects Version/s: 0.20.0
Fix Version/s: 0.20.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works bloommap-v2.patch 2008-03-29 09:50 PM Andrzej Bialecki 77 kB
Text File Licensed for inclusion in ASF works bloommap-v3.patch 2008-12-08 12:21 PM Andrzej Bialecki 96 kB
Text File Licensed for inclusion in ASF works bloommap-v4.patch 2008-12-13 03:52 AM Andrzej Bialecki 101 kB
Text File Licensed for inclusion in ASF works bloommap.patch 2008-03-21 11:09 AM Andrzej Bialecki 87 kB
Issue Links:
Dependants
 
Reference
 

Hadoop Flags: Reviewed
Release Note: Introduced BloomMapFile subclass of MapFile that creates a Bloom filter from all keys.
Resolution Date: 15/Dec/08 09:00 PM


 Description  « Hide
The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.

This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.

Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Andrzej Bialecki added a comment - 21/Mar/08 11:09 AM
BloomMapFile implementation and JUnit test.

NOTE 1: I wasn't sure how to approach the issue of the org.onelab.* classes that I borrowed from HBase (which originally were a part of Hadoop core ). For now they are included verbatim here in this patch.

NOTE 2: the BloomFilter and DynamicBloomFilter classes contained a few bugs related to their Writable (de)serialization, some of them related to specific assumptions about the environment, some others fatal under all conditions. This patch contains these fixes too.


Jim Kellerman added a comment - 21/Mar/08 04:17 PM
If you have fixes for BloomFilter and DynamicBloomFilter, please open a Jira for HBase https://issues.apache.org/jira/browse/HBASE and submit a patch there as well. Thanks.

Doug Cutting added a comment - 21/Mar/08 04:32 PM
We should avoid replicating source files. So files are copied from HBase to Core, then they should ideally then be removed from HBase, since HBase relies on Core.

As for the org.onelab classes: shouldn't we import these as a jar rather than as source?


Andrzej Bialecki added a comment - 21/Mar/08 04:41 PM
Re: replicating the source file - I agree, I just wasn't sure how to solve this.

My experience with these classes is that they are not well-debugged, so often bug fixing is necessary. The original creator was an EU IST project, so it's unlikely we could expect any maintenance from that project. Therefore I don't think that packaging them as a third-party jar is a good option.

I think it would be best to import these classes into org.apache.hadoop.util.bloom package, and keep the comment about the original authors in the javadoc, as we do it now, and subsequently remove these classes form HBase.


Jim Kellerman added a comment - 21/Mar/08 05:01 PM
We can't import them as a Jar because they implement Writable

Doug Cutting added a comment - 21/Mar/08 05:13 PM
> I think it would be best to import these classes into org.apache.hadoop.util.bloom package [ ...]

+1

The license should also be added to our top-level LICENSE.txt, in the style of

http://svn.apache.org/repos/asf/httpd/httpd/trunk/LICENSE


Andrzej Bialecki added a comment - 29/Mar/08 09:50 PM
Updated patch. This patch imports the Bloom filter classes into org.apache.hadoop.util.bloom, and adds a notice to LICENSE.txt.

Hadoop QA added a comment - 29/Mar/08 11:04 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12378873/bloommap-v2.patch
against trunk revision 619744.

@author +1. The patch does not contain any @author tags.

tests included +1. The patch appears to include 4 new or modified tests.

javadoc -1. The javadoc tool appears to have generated 1 warning messages.

javac -1. The applied patch generated 579 javac compiler warnings (more than the trunk's current 568 warnings).

release audit +1. The applied patch does not generate any new release audit warnings.

findbugs -1. The patch appears to introduce 3 new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2098/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2098/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2098/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2098/console

This message is automatically generated.


Andrzej Bialecki added a comment - 10/Apr/08 04:50 PM
I'm not sure what to do about the Findbugs warnings. There are two of them, and both don't make sense to me.
  • CN_IDIOM_NO_SUPER_CALL: clone method does not call super.clone() in BloomFilter.clone() and in DynamicBloomFilter.clone() - how am I supposed to use Object.clone() here?
  • IS2_INCONSISTENT_SYNC: Inconsistent synchronization in BloomMapFile.initBloomFilter() - this access is not synchronized because it's called from the constructor. Other accesses are synchronized because the MapFile methods require synchronization. Should I add "synchronized" to initBloomFilter(), even though it's not needed there?

Owen O'Malley added a comment - 22/Apr/08 09:32 PM
It is complaining that your clone methods don't call the super.clone, which is generally considered good style. (So in your case, Filter needs a clone method that copies its fields that is called by the subtypes.)

Yes, please add synchronization. It is far easier to add synchronization to a method that deal with missing synchronization.

You also need to address the javac and javadoc warnings.


Edward Bruce Williams added a comment - 02/Dec/08 10:21 PM
What is the status on this? I was assigned to fix some bugs in the Hbase Bloom Filter code, but moving the code back into Hadoop and removing it from HBase needs to be done first. Can I consider HBase 533 unblocked? There is also a new hashing mechanism now that was added to hbase by Andrzej, MurmurHash, that should also be moved back up into hadoop.

Andrzej Bialecki added a comment - 02/Dec/08 10:32 PM
Sorry, I got involved in other stuff - I'd like to prepare a new patch in the next day or two, perhaps we could squeeze it into 0.20 ... For the purpose of this patch I'll take the latest BloomFilter-s as they are now in HBase (plus the MurmurHash).

Edward Bruce Williams added a comment - 02/Dec/08 10:55 PM
Thanks, refactoring of the onelab stuff into a hadoop util.bloomfilters would cleanup some awkward code tangles in hbase.

Andrzej Bialecki added a comment - 08/Dec/08 12:21 PM
This patch uses the latest versions of Bloom filters and Hash implementations from HBase. It successfully passes the 'ant test-patch' command:

[exec] +1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] +1 tests included. The patch appears to include 4 new or modified tests.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
[exec]
[exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.


Andrzej Bialecki added a comment - 09/Dec/08 09:14 AM
Updated patch.

Andrzej Bialecki added a comment - 10/Dec/08 02:23 PM
Any chance that this issue makes it into 0.20 ?

Hadoop QA added a comment - 11/Dec/08 08:35 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12395548/bloommap-v3.patch
against trunk revision 725729.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 4 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 Eclipse classpath. The patch retains Eclipse classpath integrity.

-1 core tests. The patch failed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3709/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3709/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3709/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3709/console

This message is automatically generated.


stack added a comment - 11/Dec/08 09:24 PM
Andrzej: It failed its own unit test up on hudson org.apache.hadoop.io.TestBloomMapFile.testMembershipTest.
Error Message

hashType must be known

Stacktrace

java.lang.IllegalArgumentException: hashType must be known
	at org.apache.hadoop.util.bloom.HashFunction.<init>(HashFunction.java:92)
	at org.apache.hadoop.util.bloom.Filter.<init>(Filter.java:102)
	at org.apache.hadoop.util.bloom.DynamicBloomFilter.<init>(DynamicBloomFilter.java:121)
	at org.apache.hadoop.io.BloomMapFile$Writer.initBloomFilter(BloomMapFile.java:150)
	at org.apache.hadoop.io.BloomMapFile$Writer.<init>(BloomMapFile.java:143)
	at org.apache.hadoop.io.TestBloomMapFile.testMembershipTest(TestBloomMapFile.java:37)

I ran it locally and got same failure.


stack added a comment - 11/Dec/08 10:34 PM
Reviewing the patch, it looks great. The above failure would seem to be because the last two arguments need to be flipped:
236 +      bloomFilter = new DynamicBloomFilter(vectorSize, HASH_COUNT, numKeys,
 237 +          Hash.getHashType(conf));

....the DBF constructor looks like this:

1528 +  public DynamicBloomFilter(int vectorSize, int nbHash, int hashType, int nr) {

Test still fails with an assertion error so maybe the above is not right.

There is an hbasism still in the code that you probably want to remove:

408 +    String name = conf.get("hbase.hash.type", "murmur");

Andrzej Bialecki added a comment - 13/Dec/08 03:52 AM
Apart from fumbling the args passed to the constructor, the unit test failed for a valid reason - the calculation of BloomFilter size was wrong. I fixed the calculation and added a config knob to adjust the error rate.

This time it passes the tests and gets +1's from ant test-patch.


stack added a comment - 13/Dec/08 09:09 PM
+1 Test now passes locally with v4. Took quick look at patch. Looks good to me.

Hadoop QA added a comment - 14/Dec/08 12:25 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12395994/bloommap-v4.patch
against trunk revision 726129.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 4 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 Eclipse classpath. The patch retains Eclipse classpath integrity.

-1 core tests. The patch failed core unit tests.

-1 contrib tests. The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3739/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3739/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3739/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3739/console

This message is automatically generated.


stack added a comment - 14/Dec/08 03:41 AM
The failing tests seem unrelated to this patch:
org.apache.hadoop.chukwa.datacollection.adaptor.filetailer.TestFileTailingAdaptors.testLogRotate
org.apache.hadoop.hdfs.TestFileAppend2.testComplexAppend

IMO, these failures shouldn't hold up committing this patch.


stack added a comment - 15/Dec/08 09:00 PM
Committed on behalf of Andrzej. He's swamped today and he (and I) wanted to get this in before 0.20.0 feature-freeze.

Robert Chansler added a comment - 03/Mar/09 07:08 PM
Edit release note for publication.