Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 1.4
    • Fix Version/s: 4.9, 5.0
    • Component/s: update
    • Labels:
      None

      Description

      • The use case is indexing in Hadoop and checking for duplicates
        against a Solr cluster (which when using term dictionary or a
        query) is too slow and exceeds the time consumed for indexing.
        When a match is found, the host, segment, and term are returned.
        If the same term is found on multiple servers, multiple results
        are returned by the distributed process. (We'll need to add in
        the core name I just realized).
      • When new segments are created, and commit is called, a new
        bloom filter is generated from a given field (default:id) by
        iterating over the term dictionary values. There's a bloom
        filter file per segment, which is managed on each Solr shard.
        When segments are merged away, their corresponding .blm files is
        also removed. In a future version we'll have a central server
        for the bloom filters so we're not abusing the thread pool of
        the Solr proxy and the networking of the Solr cluster (this will
        be done sooner than later after testing this version). I held
        off because the central server requires syncing the Solr
        servers' files (which is like reverse replication).
      • Distributed code is added and seems to work, I extended
        TestDistributedSearch to test over multiple HTTP servers. I
        chose this approach rather than the manual method used by (for
        example) TermVectorComponent.testDistributed because I'm new to
        Solr's distributed search and wanted to learn how it works (the
        stages are confusing). Using this method, I didn't need to setup
        multiple tomcat servers and manually execute tests.
      • We need more of the bloom filter options passable via
        solrconfig
      • I'll add more test cases
      1. SOLR-1375.patch
        49 kB
        Jason Rutherglen
      2. SOLR-1375.patch
        132 kB
        Jason Rutherglen
      3. SOLR-1375.patch
        133 kB
        Jason Rutherglen
      4. SOLR-1375.patch
        133 kB
        Jason Rutherglen
      5. SOLR-1375.patch
        134 kB
        Jason Rutherglen

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          Move issue to Solr 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Solr 4.9.
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Isaac Hebsh added a comment -

          Thanks Shawn Heisey. You are probably right. Maybe the wiki page, which links to here, should be marked properly...

          Show
          Isaac Hebsh added a comment - Thanks Shawn Heisey . You are probably right. Maybe the wiki page, which links to here, should be marked properly...
          Hide
          Shawn Heisey added a comment -

          Isaac Hebsh Lucene has bloomfilter functionality, but it isn't a complete postings format - it wraps another format. To get it into Solr, a new class needs to be created.

          It's on my todo list via SOLR-3950. This issue is about including Hadoop's functionality in Solr, which I am pretty sure we won't be doing. I think this needs to be closed as a duplicate of SOLR-3950, even though it is an older issue.

          Show
          Shawn Heisey added a comment - Isaac Hebsh Lucene has bloomfilter functionality, but it isn't a complete postings format - it wraps another format. To get it into Solr, a new class needs to be created. It's on my todo list via SOLR-3950 . This issue is about including Hadoop's functionality in Solr, which I am pretty sure we won't be doing. I think this needs to be closed as a duplicate of SOLR-3950 , even though it is an older issue.
          Hide
          Isaac Hebsh added a comment -

          Hi, I know that this issue is not active for a long while, But I could not understand why. What work should be done in order to make the issue resolved?

          Show
          Isaac Hebsh added a comment - Hi, I know that this issue is not active for a long while, But I could not understand why. What work should be done in order to make the issue resolved?
          Hide
          Mark Harwood added a comment -

          4069 is an implementation of Otis' suggestion - core Lucene maintaining a Bloom filter for terms in a segment.

          Show
          Mark Harwood added a comment - 4069 is an implementation of Otis' suggestion - core Lucene maintaining a Bloom filter for terms in a segment.
          Hide
          Mark Harwood added a comment -

          FYI, I came across this issue while exploring a different use case but adopting a similar implementation. I thought it might be useful to share some numbers:

          I found I could reduce primary key lookup costs significantly if each segment-level reader cached a Bloom filter to avoid searching segments that showed no trace of the key in the accompanying filter.
          On a 1.5GB index with 20 segments 1,000 key lookups took 6 seconds with segment-level Bloom filters compared to 44 seconds if unfiltered.
          Each segment had a BloomFilter represented by an OpenBitset sized for 1 million bits and constructed from a TermEnum walk of terms which were then hashed with MurmurHash. This cost only ~2mb of RAM (20 segs x 1m bits). One advantage of using a fixed size bitset is that these could be happily ORed together if we could only hook into Lucene's segment merge process - this would avoid the cost of another TermEnum walk-and-hash run when opening newly merged segments. Of course if these Bitsets are serialized as ".blm" files as part of core Lucene then merge operations would maintain these in the background and always ensure faster start up from cold.

          Show
          Mark Harwood added a comment - FYI, I came across this issue while exploring a different use case but adopting a similar implementation. I thought it might be useful to share some numbers: I found I could reduce primary key lookup costs significantly if each segment-level reader cached a Bloom filter to avoid searching segments that showed no trace of the key in the accompanying filter. On a 1.5GB index with 20 segments 1,000 key lookups took 6 seconds with segment-level Bloom filters compared to 44 seconds if unfiltered. Each segment had a BloomFilter represented by an OpenBitset sized for 1 million bits and constructed from a TermEnum walk of terms which were then hashed with MurmurHash. This cost only ~2mb of RAM (20 segs x 1m bits). One advantage of using a fixed size bitset is that these could be happily ORed together if we could only hook into Lucene's segment merge process - this would avoid the cost of another TermEnum walk-and-hash run when opening newly merged segments. Of course if these Bitsets are serialized as ".blm" files as part of core Lucene then merge operations would maintain these in the background and always ensure faster start up from cold.
          Hide
          Hoss Man added a comment -

          Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently.

          email notification suppressed to prevent mass-spam
          psuedo-unique token identifying these issues: hoss20120321nofix36

          Show
          Hoss Man added a comment - Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently. email notification suppressed to prevent mass-spam psuedo-unique token identifying these issues: hoss20120321nofix36
          Hide
          Robert Muir added a comment -

          3.4 -> 3.5

          Show
          Robert Muir added a comment - 3.4 -> 3.5
          Hide
          Robert Muir added a comment -

          Bulk move 3.2 -> 3.3

          Show
          Robert Muir added a comment - Bulk move 3.2 -> 3.3
          Hide
          Hoss Man added a comment -

          Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email...

          http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

          Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed.

          A unique token for finding these 240 issues in the future: hossversioncleanup20100527

          Show
          Hoss Man added a comment - Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed. A unique token for finding these 240 issues in the future: hossversioncleanup20100527
          Hide
          Jason Rutherglen added a comment -

          Doesn't this hint at some of this stuff (haven't looked at the patch) really needing to live in Lucene index segment files merging land?

          Adding this to Lucene is out of the scope of what I require, however I don't have time unless it's going to be committed.

          Show
          Jason Rutherglen added a comment - Doesn't this hint at some of this stuff (haven't looked at the patch) really needing to live in Lucene index segment files merging land? Adding this to Lucene is out of the scope of what I require, however I don't have time unless it's going to be committed.
          Hide
          Ted Dunning added a comment -

          Sorry to comment late here, but when indexing in hadoop, it is really nice to avoid any central dependence. It is also nice to focus the map-side join on items likely to match. Thirdly, reduce side indexing is typically really important.

          The conclusions from these three considerations vary by duplication rate. Using reduce-side indexing gets rid of most of the problems of duplicate versions of a single document (with the same sort key) since the reducer can scan to see whether it has the final copy handy before adding a document to the index.

          There remain problems where we have to not index documents that already exist in the index or to generate a deletion list that can assist in applying the index update. The former problem is usually the more severe one because it isn't unusual for data sources to just include a full dump of all documents and assume that the consumer will figure out which are new or updated. Here you would like to only index new and modified documents.

          My own preference for this is to avoid the complication of the map-side join using Bloom filters and simply export a very simple list of stub documents that correspond to the documents in the index. These stub documents should be much smaller than the average document (unless you are indexing tweets) which makes passing around great masses of stub documents not such a problem since Hadoop shuffle, copy and sort times are all dominated by Lucene index times. Passing stub documents allows the reducer to simply iterate through all documents with the same key keeping the latest version or any stub that is encountered. For documents without a stub, normal indexing can be done with the slight addition exporting a list of stub documents for the new additions.

          The same thing could be done with a map-side join, but the trade-off is that you now need considerably more memory for the mapper to store the entire bitmap in memory as opposed needing (somewhat) more time to pass the stub documents around. How that trade-off plays out in the real world isn't clear. My personal preference is to keep heap space small since the time cost is pretty minimal for me.

          This problem also turns up in our PDF conversion pipeline where we keep check-sums of each PDF that has already been converted to viewable forms. In that case, the ratio of real document size to stub size is even more preponderate.

          Show
          Ted Dunning added a comment - Sorry to comment late here, but when indexing in hadoop, it is really nice to avoid any central dependence. It is also nice to focus the map-side join on items likely to match. Thirdly, reduce side indexing is typically really important. The conclusions from these three considerations vary by duplication rate. Using reduce-side indexing gets rid of most of the problems of duplicate versions of a single document (with the same sort key) since the reducer can scan to see whether it has the final copy handy before adding a document to the index. There remain problems where we have to not index documents that already exist in the index or to generate a deletion list that can assist in applying the index update. The former problem is usually the more severe one because it isn't unusual for data sources to just include a full dump of all documents and assume that the consumer will figure out which are new or updated. Here you would like to only index new and modified documents. My own preference for this is to avoid the complication of the map-side join using Bloom filters and simply export a very simple list of stub documents that correspond to the documents in the index. These stub documents should be much smaller than the average document (unless you are indexing tweets) which makes passing around great masses of stub documents not such a problem since Hadoop shuffle, copy and sort times are all dominated by Lucene index times. Passing stub documents allows the reducer to simply iterate through all documents with the same key keeping the latest version or any stub that is encountered. For documents without a stub, normal indexing can be done with the slight addition exporting a list of stub documents for the new additions. The same thing could be done with a map-side join, but the trade-off is that you now need considerably more memory for the mapper to store the entire bitmap in memory as opposed needing (somewhat) more time to pass the stub documents around. How that trade-off plays out in the real world isn't clear. My personal preference is to keep heap space small since the time cost is pretty minimal for me. This problem also turns up in our PDF conversion pipeline where we keep check-sums of each PDF that has already been converted to viewable forms. In that case, the ratio of real document size to stub size is even more preponderate.
          Hide
          Otis Gospodnetic added a comment -

          Heh, with the Lucene/Solr merge taking place now, my previous comment above makes even more sense. What do you think?

          Show
          Otis Gospodnetic added a comment - Heh, with the Lucene/Solr merge taking place now, my previous comment above makes even more sense. What do you think?
          Hide
          Otis Gospodnetic added a comment -

          When new segments are created, and commit is called, a new
          bloom filter is generated from a given field (default:id) by
          iterating over the term dictionary values. There's a bloom
          filter file per segment, which is managed on each Solr shard.
          When segments are merged away, their corresponding .blm files is
          also removed.

          Doesn't this hint at some of this stuff (haven't looked at the patch) really needing to live in Lucene index segment files merging land?

          Show
          Otis Gospodnetic added a comment - When new segments are created, and commit is called, a new bloom filter is generated from a given field (default:id) by iterating over the term dictionary values. There's a bloom filter file per segment, which is managed on each Solr shard. When segments are merged away, their corresponding .blm files is also removed. Doesn't this hint at some of this stuff (haven't looked at the patch) really needing to live in Lucene index segment files merging land?
          Hide
          Jason Rutherglen added a comment -
          • Removes the need to set the shards.qt in solrconfig.xml.
          • The test case verifies the correct .blm files are created
          Show
          Jason Rutherglen added a comment - Removes the need to set the shards.qt in solrconfig.xml. The test case verifies the correct .blm files are created
          Hide
          Jason Rutherglen added a comment -

          The other attribute to add is the ability to set the hash function to use (i.e. Murmur)

          Show
          Jason Rutherglen added a comment - The other attribute to add is the ability to set the hash function to use (i.e. Murmur)
          Hide
          Jason Rutherglen added a comment -

          Patch which allows shards to be set in defaults, and not cause an infinite loop if the shard is the same as the host.

          Show
          Jason Rutherglen added a comment - Patch which allows shards to be set in defaults, and not cause an infinite loop if the shard is the same as the host.
          Hide
          Andrzej Bialecki added a comment -

          See here for a Java impl. of FastBits: http://code.google.com/p/compressedbitset/ .

          Re: BloomFilters - in BloomIndexComponent you seem to assume that when BloomKeySet.contains(key) returns true then a key exists in a set. This is not strictly speaking true. You can only be sure with 1.0 probability that a key does NOT exist in a set, for other key when the result is true you only have a (1.0 - eps) probability that the answer is correct, i.e. the BloomFilter will return a false positive result for non-existent keys, with (eps) probability. You should take this into account when writing client code.

          Show
          Andrzej Bialecki added a comment - See here for a Java impl. of FastBits: http://code.google.com/p/compressedbitset/ . Re: BloomFilters - in BloomIndexComponent you seem to assume that when BloomKeySet.contains(key) returns true then a key exists in a set. This is not strictly speaking true. You can only be sure with 1.0 probability that a key does NOT exist in a set, for other key when the result is true you only have a (1.0 - eps) probability that the answer is correct, i.e. the BloomFilter will return a false positive result for non-existent keys, with (eps) probability. You should take this into account when writing client code.
          Hide
          Jason Rutherglen added a comment -
          • Bug fixes
          • Core name included in response
          Show
          Jason Rutherglen added a comment - Bug fixes Core name included in response Wiki is located at http://wiki.apache.org/solr/BloomIndexComponent
          Hide
          Jason Rutherglen added a comment -
          • The Hadoop BloomFilter code is included in the patch
          Show
          Jason Rutherglen added a comment - The Hadoop BloomFilter code is included in the patch
          Hide
          Jason Rutherglen added a comment -

          Lance,

          Thanks for the info. BloomFilters can be ORed together. Hadoop
          uses BFs for map side joins, which is similar to this use case.

          Centralizing will help when performing millions of id membership
          tests, though I'm going to benchmark first and see if the
          current patch is good enough.

          -J

          Show
          Jason Rutherglen added a comment - Lance, Thanks for the info. BloomFilters can be ORed together. Hadoop uses BFs for map side joins, which is similar to this use case. Centralizing will help when performing millions of id membership tests, though I'm going to benchmark first and see if the current patch is good enough. -J
          Hide
          Lance Norskog added a comment - - edited

          At my previous job, we were attempting to add the same document up to 100x per day. We used an MD5 signature for the id and made a bitmap file to pre-check ids before attempting to add them. Because we did not created a bitmap file with 2^32 bits (512M) instead of (2^128) we also had a false positive problem which we were willing to put up with. (It was ok if we did not add all documents. )

          We also had the advantage that different feeding machines pulled documents from different sources, and so machine A's set of repeated documents was separate from machine B's. Therefore, each could keep its own bitmap file and the files could be OR'd together periodically in background.

          I can't recommend what we did. If you like the Bloom Filter for this problem, that's great.

          This project: FastBits IBIS claims to be super-smart about compressing bits in a disk archive. It might be a better technology than the Nutch Bloom Filter, but there is no Java and the C is a different license.

          I would counsel against making a central server ; Solr technologies should be distributed and localized (close to the Solr instance) as possible.

          Show
          Lance Norskog added a comment - - edited At my previous job, we were attempting to add the same document up to 100x per day. We used an MD5 signature for the id and made a bitmap file to pre-check ids before attempting to add them. Because we did not created a bitmap file with 2^32 bits (512M) instead of (2^128) we also had a false positive problem which we were willing to put up with. (It was ok if we did not add all documents. ) We also had the advantage that different feeding machines pulled documents from different sources, and so machine A's set of repeated documents was separate from machine B's. Therefore, each could keep its own bitmap file and the files could be OR'd together periodically in background. I can't recommend what we did. If you like the Bloom Filter for this problem, that's great. This project: FastBits IBIS claims to be super-smart about compressing bits in a disk archive. It might be a better technology than the Nutch Bloom Filter, but there is no Java and the C is a different license. I would counsel against making a central server ; Solr technologies should be distributed and localized (close to the Solr instance) as possible.
          Hide
          Jason Rutherglen added a comment -

          Patch file

          Show
          Jason Rutherglen added a comment - Patch file

            People

            • Assignee:
              Unassigned
              Reporter:
              Jason Rutherglen
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - 120h
                120h
                Remaining:
                Remaining Estimate - 120h
                120h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development