Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-456

Duplicate hits and missing hits in sorted search

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: None
    • Component/s: core/search
    • Labels:
      None
    • Environment:

      JDK 1.4.2_06, probably OS independant, testet on Solaris 8 and Win2000

      Description

      If using a searcher that subclasses from IndexSearcher I get different result sets (besides the ordering of course). The problem only occurrs if the searcher is wrapped by (Parallel)MultiSearcher and the index is not too small. The number of hits returned by un unsorted and a sorted search are identical but the hits are referencing different documents. A closer look at the result sets revealed that the sorted search returns duplicate hits.

      I created test cases for Lucene 1.4.3 as well as for the head release. The problem showed up for both, the number of duplicates beeing bigger for the head realease. The test cases are written for package org.apache.lucene.search. There are messages describing the problem written to the console. In order to see all those hints the asserts are commented out. So dont't be confused if junit reports no errors. (Sorry, beeing a novice user of the bug tracker I don't see any means to attach the test cases on this screen. Let's see.)

      1. FieldDocSortedHitQueue_dups.txt
        0.5 kB
        Yonik Seeley
      2. TestCustomSearcherSort_1_4_3.java
        10 kB
        Martin Seitz
      3. TestCustomSearcherSort_HEAD.java
        10 kB
        Martin Seitz

        Activity

        Hide
        mseitz Martin Seitz added a comment -

        The test case for Lucene 1.4.3.

        Show
        mseitz Martin Seitz added a comment - The test case for Lucene 1.4.3.
        Hide
        mseitz Martin Seitz added a comment -

        Test case for HEAD revision form oct 12, 2005.

        Show
        mseitz Martin Seitz added a comment - Test case for HEAD revision form oct 12, 2005.
        Hide
        lvl Luc Vanlerberghe added a comment -

        This could be related to issue 453 I submitted a few days ago.

        (Parallel-)MultiSearcher uses a FieldDocSortedHitQueue to merge the results from the underlying IndexSearchers into the final result list (even if there's only one).

        There's a bug if two documents are compared that don't have the field that is being compared. If one document doesn't have the field, then it should come first, but if both don't have it, they should be considered equal and the next SortField tried.

        In the current implementation (1.4.3 and head) when both fields are null, one of the two is always chosen as being 'lessThan' the other. Since this is not consistent (the value of the lessThan depends of the order of the parameters in that case) it changes the sort order (as I observed) and can probably 'confuse' the queue implementation so you get dropped and double hits.

        Could you try to apply the patch I submitted for FieldDocSortedHitQueue and see if that helps in your case too?

        Show
        lvl Luc Vanlerberghe added a comment - This could be related to issue 453 I submitted a few days ago. (Parallel-)MultiSearcher uses a FieldDocSortedHitQueue to merge the results from the underlying IndexSearchers into the final result list (even if there's only one). There's a bug if two documents are compared that don't have the field that is being compared. If one document doesn't have the field, then it should come first, but if both don't have it, they should be considered equal and the next SortField tried. In the current implementation (1.4.3 and head) when both fields are null, one of the two is always chosen as being 'lessThan' the other. Since this is not consistent (the value of the lessThan depends of the order of the parameters in that case) it changes the sort order (as I observed) and can probably 'confuse' the queue implementation so you get dropped and double hits. Could you try to apply the patch I submitted for FieldDocSortedHitQueue and see if that helps in your case too?
        Hide
        mseitz Martin Seitz added a comment -

        The patch doesn't fix the problem. On the contrary the number of reported duplicates increases. However, the fact that the patch changes the behaviour at all may imply that the problem does relate to the FieldSortedHitQueue.
        As an additional test I removed the if statement in my testcase that causes field "publicationDate_" staying empty for some documents. (So that all documents have that field.) After the change there are still duplicates in the result set, however there are less. This implies that the problem has not directly to do with sorting of empty fields.

        Show
        mseitz Martin Seitz added a comment - The patch doesn't fix the problem. On the contrary the number of reported duplicates increases. However, the fact that the patch changes the behaviour at all may imply that the problem does relate to the FieldSortedHitQueue. As an additional test I removed the if statement in my testcase that causes field "publicationDate_" staying empty for some documents. (So that all documents have that field.) After the change there are still duplicates in the result set, however there are less. This implies that the problem has not directly to do with sorting of empty fields.
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        The problem is when docs are equal, the order is unspecified. That in conjunction with the fact that Hits can do multiple searches under the covers, leads to your document duplication.

        Everything works fine after I apply the patch I'm attaching here, in addition to the patch here:
        http://issues.apache.org/jira/browse/LUCENE-453
        And for good measure, the patch here:
        http://issues.apache.org/jira/browse/LUCENE-374

        Show
        yseeley@gmail.com Yonik Seeley added a comment - The problem is when docs are equal, the order is unspecified. That in conjunction with the fact that Hits can do multiple searches under the covers, leads to your document duplication. Everything works fine after I apply the patch I'm attaching here, in addition to the patch here: http://issues.apache.org/jira/browse/LUCENE-453 And for good measure, the patch here: http://issues.apache.org/jira/browse/LUCENE-374
        Hide
        mseitz Martin Seitz added a comment -

        Yep! That did it. I applied the three patches (I had to modify them slightly for release 1.4.3) and it works - the test cases as well as my application.

        Show
        mseitz Martin Seitz added a comment - Yep! That did it. I applied the three patches (I had to modify them slightly for release 1.4.3) and it works - the test cases as well as my application.
        Hide
        lvl Luc Vanlerberghe added a comment -

        I applied the three patches as well and I also don't see any anomalies any more.

        Perhaps the fact that there is a 'hidden' ultimate sorting key (based on the internal document number) for equal documents should be mentioned somewhere in the
        the documentation. It's the most logical solution to make the sort stable without sacrificing speed.

        If I understand correctly, that already existed for standalone IndexSearchers, but was 'forgotten' for (Parallel-)MultiSearchers.

        Show
        lvl Luc Vanlerberghe added a comment - I applied the three patches as well and I also don't see any anomalies any more. Perhaps the fact that there is a 'hidden' ultimate sorting key (based on the internal document number) for equal documents should be mentioned somewhere in the the documentation. It's the most logical solution to make the sort stable without sacrificing speed. If I understand correctly, that already existed for standalone IndexSearchers, but was 'forgotten' for (Parallel-)MultiSearchers.
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        patch applied to the current version.

        Show
        yseeley@gmail.com Yonik Seeley added a comment - patch applied to the current version.

          People

          • Assignee:
            yseeley@gmail.com Yonik Seeley
            Reporter:
            mseitz Martin Seitz
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development