Solr
  1. Solr
  2. SOLR-1537

Dedupe Sharded Search Results by Shard Order or Score

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.4, 1.5
    • Fix Version/s: 4.9, 5.0
    • Component/s: search
    • Labels:
      None
    • Environment:

      All

      Description

      Allows sharded search results to dedupe results by ID based on either the order of the shards in the shards param or by score. Allows the result returned to be deterministic. If by shards then shards that appear first in the shards param have a higher precedence than shards that appear later. If by score then higher scores beat out lower scores. This doesn't allow multiple duplicates because currently SOLR only permits a single result by ID to be returned.

      1. solr-dedupe-20091106-3.patch
        28 kB
        Dennis Kubes
      2. solr-dedupe-20091031-2.patch
        23 kB
        Dennis Kubes
      3. solr-dedupe-20091031.patch
        22 kB
        Dennis Kubes
      4. SOLR-1537-20091126-4.patch
        43 kB
        Dennis Kubes

        Issue Links

          Activity

          Hide
          Dennis Kubes added a comment -

          Basic patch. No unit tests. Gives dedupe functionality for shards based on either shard order in the shard param or by score.

          Show
          Dennis Kubes added a comment - Basic patch. No unit tests. Gives dedupe functionality for shards based on either shard order in the shard param or by score.
          Hide
          Dennis Kubes added a comment -

          Updated patch. Had to replace the use of the TreeSet for on the fly document queuing with a two pass HashSet and Java 5 PriorityQueue. This was to allow comparably equal documents (i.e. documents with the same score).

          Show
          Dennis Kubes added a comment - Updated patch. Had to replace the use of the TreeSet for on the fly document queuing with a two pass HashSet and Java 5 PriorityQueue. This was to allow comparably equal documents (i.e. documents with the same score).
          Hide
          Otis Gospodnetic added a comment -

          The "ID" here being the uniqueKey? i.e. the use case is the removal of dupes when the same document is indexed in multiple shards and more than 1 shard return that document in the result set?

          Show
          Otis Gospodnetic added a comment - The "ID" here being the uniqueKey? i.e. the use case is the removal of dupes when the same document is indexed in multiple shards and more than 1 shard return that document in the result set?
          Hide
          Dennis Kubes added a comment -

          That is correct. Dupes is when more than one shard returns a values for the same uniqueKey. Removal of dupes is by uniqueKey deterministically by either order of shards or by highest score. Before there was no way to determine which dupe would show up because it was based on whichever shard returned first from the query broadcast to multiple shards. In other words the fastest responding shard would give the first uniqueKey value and the rest with that uniqueKey would be ignored. Fastest though could change between query requests.

          Show
          Dennis Kubes added a comment - That is correct. Dupes is when more than one shard returns a values for the same uniqueKey. Removal of dupes is by uniqueKey deterministically by either order of shards or by highest score. Before there was no way to determine which dupe would show up because it was based on whichever shard returned first from the query broadcast to multiple shards. In other words the fastest responding shard would give the first uniqueKey value and the rest with that uniqueKey would be ignored. Fastest though could change between query requests.
          Hide
          Dennis Kubes added a comment -

          Fixes small issue with numFound count being double.

          Show
          Dennis Kubes added a comment - Fixes small issue with numFound count being double.
          Hide
          Dennis Kubes added a comment -

          The newest patch supercedes SOLR-1143, fixing some bugs, updating unit tests, and adding the ability to return partial results even if server names are misspelled as opposed to just simple connection errors. Also adds headers to show number of shards failed and the names of the failed shards.

          Show
          Dennis Kubes added a comment - The newest patch supercedes SOLR-1143 , fixing some bugs, updating unit tests, and adding the ability to return partial results even if server names are misspelled as opposed to just simple connection errors. Also adds headers to show number of shards failed and the names of the failed shards.
          Hide
          Dennis Kubes added a comment -

          Final patch. This incorporates an updated version of SOLR-1143, allowing the return of partial search results. This patch fixes bugs in the number of results returned, sorting order, errors on edge conditions, among others. This patch also supercedes SOLR-1143 bringing all unit tests up to date and adding enhanced functionality to allow returning partial results when servers names are mispelled or there are other errors besides simple connection errors. Headers have been added to show the number of shards failing and the names of those shards. Unit test have been added to demonstrate dedup of search results by shard order. This patch passes all current unit tests.

          Show
          Dennis Kubes added a comment - Final patch. This incorporates an updated version of SOLR-1143 , allowing the return of partial search results. This patch fixes bugs in the number of results returned, sorting order, errors on edge conditions, among others. This patch also supercedes SOLR-1143 bringing all unit tests up to date and adding enhanced functionality to allow returning partial results when servers names are mispelled or there are other errors besides simple connection errors. Headers have been added to show the number of shards failing and the names of those shards. Unit test have been added to demonstrate dedup of search results by shard order. This patch passes all current unit tests.
          Hide
          Hoss Man added a comment -

          Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email...

          http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

          Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed.

          A unique token for finding these 240 issues in the future: hossversioncleanup20100527

          Show
          Hoss Man added a comment - Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed. A unique token for finding these 240 issues in the future: hossversioncleanup20100527
          Hide
          Robert Muir added a comment -

          Bulk move 3.2 -> 3.3

          Show
          Robert Muir added a comment - Bulk move 3.2 -> 3.3
          Hide
          Robert Muir added a comment -

          3.4 -> 3.5

          Show
          Robert Muir added a comment - 3.4 -> 3.5
          Hide
          Hoss Man added a comment -

          Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently.

          email notification suppressed to prevent mass-spam
          psuedo-unique token identifying these issues: hoss20120321nofix36

          Show
          Hoss Man added a comment - Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently. email notification suppressed to prevent mass-spam psuedo-unique token identifying these issues: hoss20120321nofix36
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Uwe Schindler added a comment -

          Move issue to Solr 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Solr 4.9.

            People

            • Assignee:
              Unassigned
              Reporter:
              Dennis Kubes
            • Votes:
              4 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:

                Development