Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-1375

BloomFilter on a field

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 1.4
    • 4.9, 6.0
    • update
    • None

    Description

      • The use case is indexing in Hadoop and checking for duplicates
        against a Solr cluster (which when using term dictionary or a
        query) is too slow and exceeds the time consumed for indexing.
        When a match is found, the host, segment, and term are returned.
        If the same term is found on multiple servers, multiple results
        are returned by the distributed process. (We'll need to add in
        the core name I just realized).
      • When new segments are created, and commit is called, a new
        bloom filter is generated from a given field (default:id) by
        iterating over the term dictionary values. There's a bloom
        filter file per segment, which is managed on each Solr shard.
        When segments are merged away, their corresponding .blm files is
        also removed. In a future version we'll have a central server
        for the bloom filters so we're not abusing the thread pool of
        the Solr proxy and the networking of the Solr cluster (this will
        be done sooner than later after testing this version). I held
        off because the central server requires syncing the Solr
        servers' files (which is like reverse replication).
      • Distributed code is added and seems to work, I extended
        TestDistributedSearch to test over multiple HTTP servers. I
        chose this approach rather than the manual method used by (for
        example) TermVectorComponent.testDistributed because I'm new to
        Solr's distributed search and wanted to learn how it works (the
        stages are confusing). Using this method, I didn't need to setup
        multiple tomcat servers and manually execute tests.
      • We need more of the bloom filter options passable via
        solrconfig
      • I'll add more test cases

      Attachments

        1. SOLR-1375.patch
          134 kB
          Jason Rutherglen
        2. SOLR-1375.patch
          133 kB
          Jason Rutherglen
        3. SOLR-1375.patch
          133 kB
          Jason Rutherglen
        4. SOLR-1375.patch
          132 kB
          Jason Rutherglen
        5. SOLR-1375.patch
          49 kB
          Jason Rutherglen

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jasonrutherglen Jason Rutherglen
              Votes:
              2 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - 120h
                  120h
                  Remaining:
                  Remaining Estimate - 120h
                  120h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified