Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7052

BytesRefHash.sort should always sort in unicode code point order

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Today BytesRefHash.sort takes a custom Comparator but we always pass it BytesRef.getUTF8SortedAsUnicodeComparator().

      1. LUCENE-7052.patch
        10 kB
        Michael McCandless
      2. LUCENE-7052-cleanup1.patch
        14 kB
        Uwe Schindler
      3. LUCENE-7052-cleanup1.patch
        6 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          mikemccand Michael McCandless added a comment -

          Simple patch.

          Show
          mikemccand Michael McCandless added a comment - Simple patch.
          Hide
          rcmuir Robert Muir added a comment -

          +1

          Show
          rcmuir Robert Muir added a comment - +1
          Hide
          dweiss Dawid Weiss added a comment -

          Are you for simplifying the code? Because I doubt it'll be any speed improvement.

          Show
          dweiss Dawid Weiss added a comment - Are you for simplifying the code? Because I doubt it'll be any speed improvement.
          Hide
          jpountz Adrien Grand added a comment -

          +1 for the simplification

          Show
          jpountz Adrien Grand added a comment - +1 for the simplification
          Hide
          mikemccand Michael McCandless added a comment -

          Yeah, just shrinking the API.

          Show
          mikemccand Michael McCandless added a comment - Yeah, just shrinking the API.
          Hide
          steve_rowe Steve Rowe added a comment -

          Mike, why did you add an implementation of codePoints() instead of using the CharSequence version (returning IntStream) + toArray()? The test passes for me with this patch:

          diff --git a/lucene/core/src/test/org/apache/lucene/util/TestBytesRefHash.java b/lucene/core/src/test/org/apache/lucene/util/TestBytesRefHash.java
          index 50d921b..c3a58ff 100644
          --- a/lucene/core/src/test/org/apache/lucene/util/TestBytesRefHash.java
          +++ b/lucene/core/src/test/org/apache/lucene/util/TestBytesRefHash.java
          @@ -168,15 +168,6 @@ public class TestBytesRefHash extends LuceneTestCase {
               }
             }
           
          -  private static int[] codePoints(String input) {
          -    int length = Character.codePointCount(input, 0, input.length());
          -    int word[] = new int[length];
          -    for (int i = 0, j = 0, cp = 0; i < input.length(); i += Character.charCount(cp)) {
          -      word[j++] = cp = input.codePointAt(i);
          -    }
          -    return word;
          -  }
          -
             /**
              * Test method for
              * {@link org.apache.lucene.util.BytesRefHash#sort()}.
          @@ -191,8 +182,8 @@ public class TestBytesRefHash extends LuceneTestCase {
                 SortedSet<String> strings = new TreeSet<>(new Comparator<String>() {
                     @Override
                     public int compare(String a, String b) {
          -            int[] aCodePoints = codePoints(a);
          -            int[] bCodePoints = codePoints(b);
          +            int[] aCodePoints = a.codePoints().toArray();
          +            int[] bCodePoints = b.codePoints().toArray();
                       for(int i=0;i<Math.min(aCodePoints.length, bCodePoints.length);i++) {
                         if (aCodePoints[i] < bCodePoints[i]) {
                           return -1;
          
          Show
          steve_rowe Steve Rowe added a comment - Mike, why did you add an implementation of codePoints() instead of using the CharSequence version (returning IntStream) + toArray()? The test passes for me with this patch: diff --git a/lucene/core/src/test/org/apache/lucene/util/TestBytesRefHash.java b/lucene/core/src/test/org/apache/lucene/util/TestBytesRefHash.java index 50d921b..c3a58ff 100644 --- a/lucene/core/src/test/org/apache/lucene/util/TestBytesRefHash.java +++ b/lucene/core/src/test/org/apache/lucene/util/TestBytesRefHash.java @@ -168,15 +168,6 @@ public class TestBytesRefHash extends LuceneTestCase { } } - private static int[] codePoints(String input) { - int length = Character.codePointCount(input, 0, input.length()); - int word[] = new int[length]; - for (int i = 0, j = 0, cp = 0; i < input.length(); i += Character.charCount(cp)) { - word[j++] = cp = input.codePointAt(i); - } - return word; - } - /** * Test method for * {@link org.apache.lucene.util.BytesRefHash#sort()}. @@ -191,8 +182,8 @@ public class TestBytesRefHash extends LuceneTestCase { SortedSet<String> strings = new TreeSet<>(new Comparator<String>() { @Override public int compare(String a, String b) { - int[] aCodePoints = codePoints(a); - int[] bCodePoints = codePoints(b); + int[] aCodePoints = a.codePoints().toArray(); + int[] bCodePoints = b.codePoints().toArray(); for(int i=0;i<Math.min(aCodePoints.length, bCodePoints.length);i++) { if (aCodePoints[i] < bCodePoints[i]) { return -1;
          Hide
          dweiss Dawid Weiss added a comment -

          Neither is actually pretty as the treeset invokes a comparator multiple times for the same string, causing multiple identical string-int[] conversions along the way. This is test-method only though, so it doesn't matter much.

          Show
          dweiss Dawid Weiss added a comment - Neither is actually pretty as the treeset invokes a comparator multiple times for the same string, causing multiple identical string-int[] conversions along the way. This is test-method only though, so it doesn't matter much.
          Hide
          thetaphi Uwe Schindler added a comment - - edited

          Hi Mike,
          I know originally we added the different comparators to be able to allow the index term dict to be sorted in different order. This never prooved to be useful, as many Lucene queries rely on the default order. The only codec that used another byte order internally was the Lucene 3 one (but it used the unicode spaghetti algorithm to reorder its term enums at runtime). As this is now all gone, I'd suggest to also remove the utf8AsUtf16 comparator. Mabye remove the comparators at all and just implement BytesRef.compareTo() and use that one for sorting?

          I checked the code: utf8SortedAsUTF16SortOrder is only used in TSTLookup nowhere else anymore (except some test that check alternative sorts - those can be removed).

          As a first step I changed the BytesRef code to no longer use inner classes and instead use a lambda to define the comparators. But I'd suggest to remove at least the UTF-16 one completely and move it as private impl detail to TSTLookup (as only used there).

          FYI: The lambda has no speed impact because it is called only once and internally compiles to a class file that implements Comparator. It just looks nicer than the horrible comparator classes

          Show
          thetaphi Uwe Schindler added a comment - - edited Hi Mike, I know originally we added the different comparators to be able to allow the index term dict to be sorted in different order. This never prooved to be useful, as many Lucene queries rely on the default order. The only codec that used another byte order internally was the Lucene 3 one (but it used the unicode spaghetti algorithm to reorder its term enums at runtime). As this is now all gone, I'd suggest to also remove the utf8AsUtf16 comparator. Mabye remove the comparators at all and just implement BytesRef.compareTo() and use that one for sorting? I checked the code: utf8SortedAsUTF16SortOrder is only used in TSTLookup nowhere else anymore (except some test that check alternative sorts - those can be removed). As a first step I changed the BytesRef code to no longer use inner classes and instead use a lambda to define the comparators. But I'd suggest to remove at least the UTF-16 one completely and move it as private impl detail to TSTLookup (as only used there). FYI: The lambda has no speed impact because it is called only once and internally compiles to a class file that implements Comparator. It just looks nicer than the horrible comparator classes
          Hide
          thetaphi Uwe Schindler added a comment -

          Oh this was already committed. Maybe open another issue to remove the obsolete comparator and convert to lambda.

          Show
          thetaphi Uwe Schindler added a comment - Oh this was already committed. Maybe open another issue to remove the obsolete comparator and convert to lambda.
          Hide
          mikemccand Michael McCandless added a comment -

          Mike, why did you add an implementation of codePoints() instead of using the CharSequence version (returning IntStream) + toArray()?

          Oh, because I didn't even know about CharSequence.codePoints!

          +1 to your patch, thanks.

          Neither is actually pretty as the treeset invokes a comparator multiple times for the same string, causing multiple identical string-int[] conversions along the way. This is test-method only though, so it doesn't matter much.

          It's definitely inefficient (converting to a sortable key on every comparison), but it keeps the code simple, which I think is usually the right tradeoff for a test case?

          As this is now all gone, I'd suggest to also remove the utf8AsUtf16 comparator. Mabye remove the comparators at all and just implement BytesRef.compareTo() and use that one for sorting?

          +1, that sounds awesome!

          Show
          mikemccand Michael McCandless added a comment - Mike, why did you add an implementation of codePoints() instead of using the CharSequence version (returning IntStream) + toArray()? Oh, because I didn't even know about CharSequence.codePoints ! +1 to your patch, thanks. Neither is actually pretty as the treeset invokes a comparator multiple times for the same string, causing multiple identical string-int[] conversions along the way. This is test-method only though, so it doesn't matter much. It's definitely inefficient (converting to a sortable key on every comparison), but it keeps the code simple, which I think is usually the right tradeoff for a test case? As this is now all gone, I'd suggest to also remove the utf8AsUtf16 comparator. Mabye remove the comparators at all and just implement BytesRef.compareTo() and use that one for sorting? +1, that sounds awesome!
          Hide
          thetaphi Uwe Schindler added a comment -

          +1 to Steve Rowe change. We are on Java 8, so use Java 8 methods.

          Show
          thetaphi Uwe Schindler added a comment - +1 to Steve Rowe change. We are on Java 8, so use Java 8 methods.
          Hide
          thetaphi Uwe Schindler added a comment -

          Mike, should I open a new issue for the comparator cleanups. I have the patch almost ready?

          Show
          thetaphi Uwe Schindler added a comment - Mike, should I open a new issue for the comparator cleanups. I have the patch almost ready?
          Hide
          mikemccand Michael McCandless added a comment -

          TestUnicodeUtil.testUTF8toUTF32 (and I'm sure other places) can also cutover to CharSequence.codePoints.

          Show
          mikemccand Michael McCandless added a comment - TestUnicodeUtil.testUTF8toUTF32 (and I'm sure other places) can also cutover to CharSequence.codePoints .
          Hide
          mikemccand Michael McCandless added a comment -

          Mike, should I open a new issue for the comparator cleanups. I have the patch almost ready?

          Yes please!!

          Show
          mikemccand Michael McCandless added a comment - Mike, should I open a new issue for the comparator cleanups. I have the patch almost ready? Yes please!!
          Hide
          thetaphi Uwe Schindler added a comment -

          Hi Mike, here is a patch, moving the UTF-16 comparator away and clean up code.

          I also changed the TreeSet comparator to a lambda using CharSequence#codePoints() because I hit the same in another test

          Show
          thetaphi Uwe Schindler added a comment - Hi Mike, here is a patch, moving the UTF-16 comparator away and clean up code. I also changed the TreeSet comparator to a lambda using CharSequence#codePoints() because I hit the same in another test
          Hide
          thetaphi Uwe Schindler added a comment -

          I wil open now other issue with the attached patch. As the Steve Rowe suggestion is implemented there, we can leave THIS issue closed.

          Show
          thetaphi Uwe Schindler added a comment - I wil open now other issue with the attached patch. As the Steve Rowe suggestion is implemented there, we can leave THIS issue closed.
          Hide
          thetaphi Uwe Schindler added a comment -

          New issue: LUCENE-7053

          Show
          thetaphi Uwe Schindler added a comment - New issue: LUCENE-7053
          Hide
          hossman Hoss Man added a comment -

          I'm confused ... this issue is marked resolved, and Uwe said "Oh this was already committed." but no commits are linked to it, not does it have any subtasks.

          can someone clarify when/how this was resolved?

          Show
          hossman Hoss Man added a comment - I'm confused ... this issue is marked resolved, and Uwe said "Oh this was already committed." but no commits are linked to it, not does it have any subtasks. can someone clarify when/how this was resolved?
          Hide
          steve_rowe Steve Rowe added a comment -

          AFAICT Mike forgot to include the JIRA in his log message for this commit (from the email sent to commits@l.a.o):

          Repository: lucene-solr
          Updated Branches:
           refs/heads/master 70440bbbd -> 126ac9a5f
          
          BytesRefHash.sort always sorts in unicode order
          
          Project: http://git-wip-us.apache.org/repos/asf/lucene-solr/repo
          Commit: http://git-wip-us.apache.org/repos/asf/lucene-solr/commit/126ac9a5
          Tree: http://git-wip-us.apache.org/repos/asf/lucene-solr/tree/126ac9a5
          Diff: http://git-wip-us.apache.org/repos/asf/lucene-solr/diff/126ac9a5
          
          Branch: refs/heads/master
          Commit: 126ac9a5fe00fbbc6870ef25ae3fc6af6cd7c557
          Parents: 70440bb
          Author: Mike McCandless <mikemccand@apache.org>
          Authored: Sat Feb 27 10:26:20 2016 -0500
          Committer: Mike McCandless <mikemccand@apache.org>
          Committed: Sat Feb 27 10:26:20 2016 -0500
          
          Show
          steve_rowe Steve Rowe added a comment - AFAICT Mike forgot to include the JIRA in his log message for this commit (from the email sent to commits@l.a.o): Repository: lucene-solr Updated Branches: refs/heads/master 70440bbbd -> 126ac9a5f BytesRefHash.sort always sorts in unicode order Project: http://git-wip-us.apache.org/repos/asf/lucene-solr/repo Commit: http://git-wip-us.apache.org/repos/asf/lucene-solr/commit/126ac9a5 Tree: http://git-wip-us.apache.org/repos/asf/lucene-solr/tree/126ac9a5 Diff: http://git-wip-us.apache.org/repos/asf/lucene-solr/diff/126ac9a5 Branch: refs/heads/master Commit: 126ac9a5fe00fbbc6870ef25ae3fc6af6cd7c557 Parents: 70440bb Author: Mike McCandless <mikemccand@apache.org> Authored: Sat Feb 27 10:26:20 2016 -0500 Committer: Mike McCandless <mikemccand@apache.org> Committed: Sat Feb 27 10:26:20 2016 -0500
          Hide
          mikemccand Michael McCandless added a comment -

          Sorry, yeah I forgot the issue number! Thanks for linking it Steve Rowe!

          Show
          mikemccand Michael McCandless added a comment - Sorry, yeah I forgot the issue number! Thanks for linking it Steve Rowe !

            People

            • Assignee:
              mikemccand Michael McCandless
              Reporter:
              mikemccand Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development