Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-1375

BloomFilter on a field



    • Type: New Feature
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 1.4
    • Fix Version/s: 4.9, 6.0
    • Component/s: update
    • Labels:


      • The use case is indexing in Hadoop and checking for duplicates
        against a Solr cluster (which when using term dictionary or a
        query) is too slow and exceeds the time consumed for indexing.
        When a match is found, the host, segment, and term are returned.
        If the same term is found on multiple servers, multiple results
        are returned by the distributed process. (We'll need to add in
        the core name I just realized).
      • When new segments are created, and commit is called, a new
        bloom filter is generated from a given field (default:id) by
        iterating over the term dictionary values. There's a bloom
        filter file per segment, which is managed on each Solr shard.
        When segments are merged away, their corresponding .blm files is
        also removed. In a future version we'll have a central server
        for the bloom filters so we're not abusing the thread pool of
        the Solr proxy and the networking of the Solr cluster (this will
        be done sooner than later after testing this version). I held
        off because the central server requires syncing the Solr
        servers' files (which is like reverse replication).
      • Distributed code is added and seems to work, I extended
        TestDistributedSearch to test over multiple HTTP servers. I
        chose this approach rather than the manual method used by (for
        example) TermVectorComponent.testDistributed because I'm new to
        Solr's distributed search and wanted to learn how it works (the
        stages are confusing). Using this method, I didn't need to setup
        multiple tomcat servers and manually execute tests.
      • We need more of the bloom filter options passable via
      • I'll add more test cases


        1. SOLR-1375.patch
          49 kB
          Jason Rutherglen
        2. SOLR-1375.patch
          132 kB
          Jason Rutherglen
        3. SOLR-1375.patch
          133 kB
          Jason Rutherglen
        4. SOLR-1375.patch
          133 kB
          Jason Rutherglen
        5. SOLR-1375.patch
          134 kB
          Jason Rutherglen

          Issue Links



              • Assignee:
                jasonrutherglen Jason Rutherglen
              • Votes:
                2 Vote for this issue
                5 Start watching this issue


                • Created:

                  Time Tracking

                  Original Estimate - 120h
                  Remaining Estimate - 120h
                  Time Spent - Not Specified
                  Not Specified