Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 1.4
    • Fix Version/s: 4.9, 5.0
    • Component/s: update
    • Labels:
      None

      Description

      • The use case is indexing in Hadoop and checking for duplicates
        against a Solr cluster (which when using term dictionary or a
        query) is too slow and exceeds the time consumed for indexing.
        When a match is found, the host, segment, and term are returned.
        If the same term is found on multiple servers, multiple results
        are returned by the distributed process. (We'll need to add in
        the core name I just realized).
      • When new segments are created, and commit is called, a new
        bloom filter is generated from a given field (default:id) by
        iterating over the term dictionary values. There's a bloom
        filter file per segment, which is managed on each Solr shard.
        When segments are merged away, their corresponding .blm files is
        also removed. In a future version we'll have a central server
        for the bloom filters so we're not abusing the thread pool of
        the Solr proxy and the networking of the Solr cluster (this will
        be done sooner than later after testing this version). I held
        off because the central server requires syncing the Solr
        servers' files (which is like reverse replication).
      • Distributed code is added and seems to work, I extended
        TestDistributedSearch to test over multiple HTTP servers. I
        chose this approach rather than the manual method used by (for
        example) TermVectorComponent.testDistributed because I'm new to
        Solr's distributed search and wanted to learn how it works (the
        stages are confusing). Using this method, I didn't need to setup
        multiple tomcat servers and manually execute tests.
      • We need more of the bloom filter options passable via
        solrconfig
      • I'll add more test cases
      1. SOLR-1375.patch
        134 kB
        Jason Rutherglen
      2. SOLR-1375.patch
        133 kB
        Jason Rutherglen
      3. SOLR-1375.patch
        133 kB
        Jason Rutherglen
      4. SOLR-1375.patch
        132 kB
        Jason Rutherglen
      5. SOLR-1375.patch
        49 kB
        Jason Rutherglen

        Issue Links

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              Jason Rutherglen
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - 120h
                120h
                Remaining:
                Remaining Estimate - 120h
                120h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development