Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-526

FieldSortedHitQueue - subsequent String sorts with different locales sort identically

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.9
    • Fix Version/s: 2.0.0
    • Component/s: core/search
    • Labels:
      None

      Description

      From my own post to the java-user list. I have looked into this further and am sure it's a bug.

      It seems to me that there's a possible bug in FieldSortedHitQueue, specifically in getCachedComparator(). This is showing up on our 1.4.3 install, but it seems from source code inspection that if it's a bug, it's in 1.9.1 also.

      The issue shows up when you need to sort results from a given IndexReader multiple times, using different locales. On line 180 (all line numbers from the 1.9.1 code), we have this:

      ScoreDocComparator comparator = lookup (reader, fieldname, type, factory);

      Then, if no comparator is found in the cache, a new one is created (line 193) and then stored in the cache (line 202). HOWEVER, both the cache lookup() and store() do NOT take into account locale; if we, on the same index reader, try to do one search sorted by Locale.FRENCH and one by Locale.ITALIAN, the first one will result in a cache miss, a new French comparator will be created, and stored in the cache. Second time through, lookup() finds the cached French comparator – even though this time, the locale parameter to getCachedComparator() is an Italian locale. Therefore, we don't create a new comparator and we use the wrong one to sort the results.

      It looks to me (unless I'm mistaken) that the FieldCacheImpl.Entry class should have an additional property, .locale, to ensure that different locales get different comparators.

      Patch (well, most of one) to follow immediately.

      1. LUCENE-526.txt
        12 kB
        Paul Cowan
      2. LUCENE-526-b.txt
        12 kB
        Paul Cowan
      3. LUCENE-526-c.txt
        5 kB
        Paul Cowan
      4. LUCENE-526-d.txt
        1 kB
        Paul Cowan

        Activity

        Hide
        pcowan Paul Cowan added a comment -

        Attached is a patch that provides an additional test case to TestSort.java for this issue (currently fails on latest trunk code) and patches to FieldCacheImpl.java and FieldSortedHitQueue.java which allow the test case to pass.

        This is not an idea solution as it lumbers FieldCacheImpl.Entry with a locale field which is only used by the cache in FieldSortedHitQueue, and not the main cache from FieldCacheImpl. This means that ONLY the FieldSortedHitQueue is locale-aware; the FieldCache itself isn't (there are no locales used in the interface). I thought I would provide this "prototype" patch, though, as an example. I am open to suggestions about how best to implement this. Options include

        • subclass FieldCacheImpl.Entry into a locale-aware version, which FieldSortedHitQueue uses
        • change the FieldCache interface to take in a locale for the various getXxxx methods (or possibly only the String ones?)
        • something else I haven't thought of...?

        I look forward to your thoughts.

        Show
        pcowan Paul Cowan added a comment - Attached is a patch that provides an additional test case to TestSort.java for this issue (currently fails on latest trunk code) and patches to FieldCacheImpl.java and FieldSortedHitQueue.java which allow the test case to pass. This is not an idea solution as it lumbers FieldCacheImpl.Entry with a locale field which is only used by the cache in FieldSortedHitQueue, and not the main cache from FieldCacheImpl. This means that ONLY the FieldSortedHitQueue is locale-aware; the FieldCache itself isn't (there are no locales used in the interface). I thought I would provide this "prototype" patch, though, as an example. I am open to suggestions about how best to implement this. Options include subclass FieldCacheImpl.Entry into a locale-aware version, which FieldSortedHitQueue uses change the FieldCache interface to take in a locale for the various getXxxx methods (or possibly only the String ones?) something else I haven't thought of...? I look forward to your thoughts.
        Hide
        pcowan Paul Cowan added a comment -

        New version of the patch with unicode escapes instead of hard-coded non-ASCII characters (which are bound to cause problems later) in the test case.

        Show
        pcowan Paul Cowan added a comment - New version of the patch with unicode escapes instead of hard-coded non-ASCII characters (which are bound to cause problems later) in the test case.
        Hide
        pcowan Paul Cowan added a comment -

        It seems that there are more issues with locale-based sorting than just the above.

        Attached is LUCENE-526-c.txt, a new version of the TestSort.java patch which adds another test case. Even after LUCENE-526-b is applied, and hence the first new test passes, the second fails.

        The second test takes the existing IndexSearcher, wraps it up as the only Searcher in a MultiSearcher, and then searches the MultiSearcher. It seems that the results come out of the IndexSearcher in the right order, but when MultiSearcher merges the results from all of its Searchers (in this case, just one) the local-specific sorting isn't done correctly.

        I will investigate further today, with a patch hopefully to follow.

        Show
        pcowan Paul Cowan added a comment - It seems that there are more issues with locale-based sorting than just the above. Attached is LUCENE-526 -c.txt, a new version of the TestSort.java patch which adds another test case. Even after LUCENE-526 -b is applied, and hence the first new test passes, the second fails. The second test takes the existing IndexSearcher, wraps it up as the only Searcher in a MultiSearcher, and then searches the MultiSearcher. It seems that the results come out of the IndexSearcher in the right order, but when MultiSearcher merges the results from all of its Searchers (in this case, just one) the local-specific sorting isn't done correctly. I will investigate further today, with a patch hopefully to follow.
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        Thanks for looking into this Paul!
        A quick pointer... a Multisearcher uses different sorting code than an IndexSearcher. IIRC, MultiSearcher uses FieldDocSortedHitQueue and an IndexSearcher uses FieldSortedHitQueue.

        It wouldn't be the first time that a bug was fixed in one but forgotten about in the other.

        Show
        yseeley@gmail.com Yonik Seeley added a comment - Thanks for looking into this Paul! A quick pointer... a Multisearcher uses different sorting code than an IndexSearcher. IIRC, MultiSearcher uses FieldDocSortedHitQueue and an IndexSearcher uses FieldSortedHitQueue. It wouldn't be the first time that a bug was fixed in one but forgotten about in the other.
        Hide
        pcowan Paul Cowan added a comment -

        Another patch, to fix the last-described issue.

        Turns out the problem was the FieldSortedHitQueue constructor; when building the fields[] array, it was ignoring the locale if there was one. The comparators[] array is correct, so FieldSortedHitQueue sorts fine, but the topdocs that come out of the search have no locale in the fields[]; therefore, the in-order insertion in MultiSearcher screws up the order as it now no longer knows what locale to use.

        Test case from LUCENE-526-c.txt now passes.

        Show
        pcowan Paul Cowan added a comment - Another patch, to fix the last-described issue. Turns out the problem was the FieldSortedHitQueue constructor; when building the fields[] array, it was ignoring the locale if there was one. The comparators[] array is correct, so FieldSortedHitQueue sorts fine, but the topdocs that come out of the search have no locale in the fields[]; therefore, the in-order insertion in MultiSearcher screws up the order as it now no longer knows what locale to use. Test case from LUCENE-526 -c.txt now passes.
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        I've reviewed and committed these patches.
        Thanks again Paul!

        Show
        yseeley@gmail.com Yonik Seeley added a comment - I've reviewed and committed these patches. Thanks again Paul!

          People

          • Assignee:
            yseeley@gmail.com Yonik Seeley
            Reporter:
            pcowan Paul Cowan
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development