- A bloom filter is a read only probabilistic set. Its useful
for verifying a key exists in a set, though it returns false
- The use case is indexing in Hadoop and checking for duplicates
against a Solr cluster (which when using term dictionary or a
query) is too slow and exceeds the time consumed for indexing.
When a match is found, the host, segment, and term are returned.
If the same term is found on multiple servers, multiple results
are returned by the distributed process. (We'll need to add in
the core name I just realized).
- When new segments are created, and commit is called, a new
bloom filter is generated from a given field (default:id) by
iterating over the term dictionary values. There's a bloom
filter file per segment, which is managed on each Solr shard.
When segments are merged away, their corresponding .blm files is
also removed. In a future version we'll have a central server
for the bloom filters so we're not abusing the thread pool of
the Solr proxy and the networking of the Solr cluster (this will
be done sooner than later after testing this version). I held
off because the central server requires syncing the Solr
servers' files (which is like reverse replication).
- The patch uses the BloomFilter from Hadoop 0.20. I want to jar
up only the necessary classes so we don't have a giant Hadoop
jar in lib.
- Distributed code is added and seems to work, I extended
TestDistributedSearch to test over multiple HTTP servers. I
chose this approach rather than the manual method used by (for
example) TermVectorComponent.testDistributed because I'm new to
Solr's distributed search and wanted to learn how it works (the
stages are confusing). Using this method, I didn't need to setup
multiple tomcat servers and manually execute tests.
- We need more of the bloom filter options passable via
- I'll add more test cases