Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7686

NRT suggester should have option to filter out duplicates

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 7.0, 6.5
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Some of the other suggesters have this ability, and it's quite simple to add it to the NRT suggester as long as the thing we are filtering on is the suggest key itself, not e.g. another stored field from the document.

      1. LUCENE-7686.patch
        66 kB
        Michael McCandless
      2. LUCENE-7686.patch
        30 kB
        Michael McCandless
      3. LUCENE-7686.patch
        8 kB
        Michael McCandless

        Activity

        Hide
        mikemccand Michael McCandless added a comment -

        Here's a patch; the duplicate handling needs to be done in the collector (and not the FST traversal) because dups could be across segments.

        Show
        mikemccand Michael McCandless added a comment - Here's a patch; the duplicate handling needs to be done in the collector (and not the FST traversal) because dups could be across segments.
        Hide
        mikemccand Michael McCandless added a comment -

        Hmm the patch is not quite right; I need to fix the TopNSearcher.acceptResult to return false in the duplicate case ... I'll iterate some more.

        Show
        mikemccand Michael McCandless added a comment - Hmm the patch is not quite right; I need to fix the TopNSearcher.acceptResult to return false in the duplicate case ... I'll iterate some more.
        Hide
        mikemccand Michael McCandless added a comment -

        OK actually the patch is OK; this suggester effectively ignores that returned boolean since we don't use the returned result from TopNSearcher.search...

        Show
        mikemccand Michael McCandless added a comment - OK actually the patch is OK; this suggester effectively ignores that returned boolean since we don't use the returned result from TopNSearcher.search ...
        Hide
        mikemccand Michael McCandless added a comment -

        Another iteration ... it was trickier than I first realized because I have to handle the case where another (later) segment has the same suggestion with a better score, and replace the previous one in the priority queue.

        So I broke out the dedup handling to a separate collector, DeduplicatingTopSuggestDocsCollector.

        And I added a new randomized test case in addition to the dedicated specific test case.

        I think it's ready.

        Show
        mikemccand Michael McCandless added a comment - Another iteration ... it was trickier than I first realized because I have to handle the case where another (later) segment has the same suggestion with a better score, and replace the previous one in the priority queue. So I broke out the dedup handling to a separate collector, DeduplicatingTopSuggestDocsCollector . And I added a new randomized test case in addition to the dedicated specific test case. I think it's ready.
        Hide
        thetaphi Uwe Schindler added a comment -

        Thanks Mike,
        I was busy yesterday because of updating Elasticsearch, so I was not able to look into it. Thanks for doing this.
        I will reply with some related suggestions on the Elastic issue.
        Uwe

        Show
        thetaphi Uwe Schindler added a comment - Thanks Mike, I was busy yesterday because of updating Elasticsearch, so I was not able to look into it. Thanks for doing this. I will reply with some related suggestions on the Elastic issue. Uwe
        Hide
        mikemccand Michael McCandless added a comment -

        Thanks Uwe Schindler ... here's the ES issue that led to this: https://github.com/elastic/elasticsearch/issues/22912

        Show
        mikemccand Michael McCandless added a comment - Thanks Uwe Schindler ... here's the ES issue that led to this: https://github.com/elastic/elasticsearch/issues/22912
        Hide
        mikemccand Michael McCandless added a comment -

        Uwe Schindler had a good suggestion on the ES issue, to use the FST earlier to dedup, instead of doing it at collection time ... I'll explore this. It should make dedup very fast.

        Show
        mikemccand Michael McCandless added a comment - Uwe Schindler had a good suggestion on the ES issue, to use the FST earlier to dedup, instead of doing it at collection time ... I'll explore this. It should make dedup very fast.
        Hide
        mikemccand Michael McCandless added a comment -

        Another iteration, this time filtering dups earlier in the top N
        search. I added a new method, acceptPartialPath to the
        Util.TopNSearcher class so that subclasses are able to prune a
        still in-progress path, not just a completed path. This should be
        quite efficient even when the number of duplicates is very high,
        because the top N search will quickly push to the one not deleted, not
        filtered out, highest scoring document with the suggestion, record
        that surface form, and then prune subsequent intermediate paths
        sharing that same surface form.

        I also added another "extreme" dedup test case to test the logic that
        computes the necessary queue size, and it's passing, and the new
        TestSuggestField.testRandom seems to survive moderate beasting...

        I think it's ready.

        Show
        mikemccand Michael McCandless added a comment - Another iteration, this time filtering dups earlier in the top N search. I added a new method, acceptPartialPath to the Util.TopNSearcher class so that subclasses are able to prune a still in-progress path, not just a completed path. This should be quite efficient even when the number of duplicates is very high, because the top N search will quickly push to the one not deleted, not filtered out, highest scoring document with the suggestion, record that surface form, and then prune subsequent intermediate paths sharing that same surface form. I also added another "extreme" dedup test case to test the logic that computes the necessary queue size, and it's passing, and the new TestSuggestField.testRandom seems to survive moderate beasting... I think it's ready.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 4e2cf61ac76db33f35d3aceacaf1563a9bd5edb2 in lucene-solr's branch refs/heads/master from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4e2cf61 ]

        LUCENE-7686: add efficient de-duping to the NRT document suggester

        Show
        jira-bot ASF subversion and git services added a comment - Commit 4e2cf61ac76db33f35d3aceacaf1563a9bd5edb2 in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4e2cf61 ] LUCENE-7686 : add efficient de-duping to the NRT document suggester
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 0d5a61b3df04593691796867ae3b32d05e66a0c0 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0d5a61b ]

        LUCENE-7686: add efficient de-duping to the NRT document suggester

        Show
        jira-bot ASF subversion and git services added a comment - Commit 0d5a61b3df04593691796867ae3b32d05e66a0c0 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0d5a61b ] LUCENE-7686 : add efficient de-duping to the NRT document suggester
        Hide
        thetaphi Uwe Schindler added a comment -

        Thanks Mike! Was not able to test it, but looks good to me!

        Show
        thetaphi Uwe Schindler added a comment - Thanks Mike! Was not able to test it, but looks good to me!

          People

          • Assignee:
            mikemccand Michael McCandless
            Reporter:
            mikemccand Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development