Lucene - Core
  1. Lucene - Core
  2. LUCENE-2068

fix reverseStringFilter for unicode 4.0

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      ReverseStringFilter is not aware of supplementary characters: when it reverses it will create unpaired surrogates, which will be replaced by U+FFFD by the indexer (but not at query time).
      The wrong words will conflate to each other, and the right words won't match, basically the whole thing falls apart.

      This patch implements in-place reverse with the algorithm from apache harmony AbstractStringBuilder.reverse0()

      1. LUCENE_2068.patch
        13 kB
        Simon Willnauer
      2. LUCENE_2068.patch
        6 kB
        Simon Willnauer
      3. LUCENE_2068.patch
        6 kB
        Simon Willnauer
      4. LUCENE-2068.patch
        13 kB
        Robert Muir
      5. LUCENE-2068.patch
        5 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Simon Willnauer added a comment -

          Commited in revision 883149

          Show
          Simon Willnauer added a comment - Commited in revision 883149
          Hide
          Simon Willnauer added a comment -

          added a CHANGES.txt entry. Will commit soon.

          Show
          Simon Willnauer added a comment - added a CHANGES.txt entry. Will commit soon.
          Hide
          Simon Willnauer added a comment -

          Robert, we can take the MatchV. Version of the patch while I still think this one should not maintain backcompatibility here as it maintains extremely broken code.

          I plan to commit this soon if nobody objects.

          simon

          Show
          Simon Willnauer added a comment - Robert, we can take the MatchV. Version of the patch while I still think this one should not maintain backcompatibility here as it maintains extremely broken code. I plan to commit this soon if nobody objects. simon
          Hide
          Simon Willnauer added a comment -

          bq: edit: sorry simon, mime-type/charset issues on my side, x-diff versus x-patch thing

          happens

          If it's really that broken for supplemental chars, we don't have to support back compat, right?

          +1 This code is completely broken without the patch, I would agree we do not need it in this case.

          simon

          Show
          Simon Willnauer added a comment - bq: edit: sorry simon, mime-type/charset issues on my side, x-diff versus x-patch thing happens If it's really that broken for supplemental chars, we don't have to support back compat, right? +1 This code is completely broken without the patch, I would agree we do not need it in this case. simon
          Hide
          Yonik Seeley added a comment -

          If it's really that broken for supplemental chars, we don't have to support back compat, right?

          Show
          Yonik Seeley added a comment - If it's really that broken for supplemental chars, we don't have to support back compat, right?
          Hide
          Robert Muir added a comment - - edited

          This patch adds back compat for the buggy behavior with version.
          It is gross because there were many public static methods exposed, but for example Solr is using these.

          btw:
          Simon, are you applying patches with Eclipse?
          If so it will not work, you need to open the patch in an editor, select all, copy, and then apply from Clipboard.
          In your patch, the test is corrupted, the characters should be chinese... I think this is why you were confused about tests before.

          edit: sorry simon, mime-type/charset issues on my side, x-diff versus x-patch thing

          Show
          Robert Muir added a comment - - edited This patch adds back compat for the buggy behavior with version. It is gross because there were many public static methods exposed, but for example Solr is using these. btw: Simon, are you applying patches with Eclipse? If so it will not work, you need to open the patch in an editor, select all, copy, and then apply from Clipboard. In your patch, the test is corrupted, the characters should be chinese... I think this is why you were confused about tests before. edit: sorry simon, mime-type/charset issues on my side, x-diff versus x-patch thing
          Hide
          Robert Muir added a comment -

          Due to the way this currently works, taking completely valid unicode data and making invalid unicode data, I am changing this to bug, as Mark hinted at.

          Show
          Robert Muir added a comment - Due to the way this currently works, taking completely valid unicode data and making invalid unicode data, I am changing this to bug, as Mark hinted at.
          Hide
          Simon Willnauer added a comment -

          We will get this in once 3.0 is out. I wonder if we should add something to the changes.txt file?

          Would make sense to me to add it to: Changes in runtime behavior

          thoughts

          Show
          Simon Willnauer added a comment - We will get this in once 3.0 is out. I wonder if we should add something to the changes.txt file? Would make sense to me to add it to: Changes in runtime behavior thoughts
          Hide
          Simon Willnauer added a comment -

          removed static import

          Show
          Simon Willnauer added a comment - removed static import
          Hide
          Simon Willnauer added a comment -

          I just think we should use a consistent scheme (does not matter to me which way)

          I agree, we should rather not use the static import - its more expressive if it is explicit!
          I will upload another patch.

          Show
          Simon Willnauer added a comment - I just think we should use a consistent scheme (does not matter to me which way) I agree, we should rather not use the static import - its more expressive if it is explicit! I will upload another patch.
          Hide
          Robert Muir added a comment -

          I guess this is good to go once 3.1 is out.

          Simon, I agree. One last question, should we do the import static of Character.*? In other patches I am using Character.xxyyZZ() instead.

          I just think we should use a consistent scheme (does not matter to me which way)

          Show
          Robert Muir added a comment - I guess this is good to go once 3.1 is out. Simon, I agree. One last question, should we do the import static of Character.*? In other patches I am using Character.xxyyZZ() instead. I just think we should use a consistent scheme (does not matter to me which way)
          Hide
          Robert Muir added a comment -

          Is this an improvement or a bug? The summary sounds kind of buggish ...

          I'll try to restrain myself, but I think we should have fixed unicode 4 support in lucene 3.0, because then it matches the unicode version of java 1.5
          the problem is we could not do any of this in 2.9, because you need java 1.5 to actually implement most of the support, so it was a chicken and egg problem.

          imho its all bugs, but i'll list these issues as improvements

          Show
          Robert Muir added a comment - Is this an improvement or a bug? The summary sounds kind of buggish ... I'll try to restrain myself, but I think we should have fixed unicode 4 support in lucene 3.0, because then it matches the unicode version of java 1.5 the problem is we could not do any of this in 2.9, because you need java 1.5 to actually implement most of the support, so it was a chicken and egg problem. imho its all bugs, but i'll list these issues as improvements
          Hide
          Robert Muir added a comment -

          Simon, thanks! l agree we should use these Character methods, because I believe the JDK will implement them in the fastest way.
          (bitmask for some things, etc)

          The reason it looked like this is because I copied & pasted from apache harmony then edited to our needs (allow char[], int, int).

          Show
          Robert Muir added a comment - Simon, thanks! l agree we should use these Character methods, because I believe the JDK will implement them in the fastest way. (bitmask for some things, etc) The reason it looked like this is because I copied & pasted from apache harmony then edited to our needs (allow char[], int, int).
          Hide
          Mark Miller added a comment -

          Is this an improvement or a bug? The summary sounds kind of buggish ...

          Show
          Mark Miller added a comment - Is this an improvement or a bug? The summary sounds kind of buggish ...
          Hide
          Simon Willnauer added a comment -

          Robert, I cleaned up the patch a little and did some minor optimizations.
          I replace the ugly surrogatePair checks with the JDKs version. This method call will be inlined anyway!

          I added some JavaDoc comments as well. I guess this is good to go once 3.1 is out.

          Show
          Simon Willnauer added a comment - Robert, I cleaned up the patch a little and did some minor optimizations. I replace the ugly surrogatePair checks with the JDKs version. This method call will be inlined anyway! I added some JavaDoc comments as well. I guess this is good to go once 3.1 is out.

            People

            • Assignee:
              Simon Willnauer
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development