Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      lowercase suppl. characters correctly.

      this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

      1. LUCENE-2069.patch
        28 kB
        Simon Willnauer
      2. LUCENE-2069.patch
        27 kB
        Simon Willnauer
      3. LUCENE-2069.patch
        27 kB
        Simon Willnauer
      4. LUCENE-2069.patch
        22 kB
        Robert Muir
      5. LUCENE-2069.patch
        22 kB
        Robert Muir
      6. LUCENE-2069.patch
        2 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          Simon, if you have a moment maybe you can review this one for me?

          Show
          Robert Muir added a comment - Simon, if you have a moment maybe you can review this one for me?
          Hide
          Simon Willnauer added a comment -

          Robert, I assume you did use those weird chars in the test on purpose - I wonder if there are some "real" codepoints that we could use in the test?

          The code looks good to me, this is the way to go for char lowercaseing with Unicode 4.0

          Show
          Simon Willnauer added a comment - Robert, I assume you did use those weird chars in the test on purpose - I wonder if there are some "real" codepoints that we could use in the test? The code looks good to me, this is the way to go for char lowercaseing with Unicode 4.0
          Hide
          Robert Muir added a comment -

          Simon, those "wierd" chars are indeed real codepoints that have lowercasing behavior in Unicode 4.0!

          Show
          Robert Muir added a comment - Simon, those "wierd" chars are indeed real codepoints that have lowercasing behavior in Unicode 4.0!
          Hide
          Simon Willnauer added a comment -

          we might need a changes.txt entry here too?!

          Show
          Simon Willnauer added a comment - we might need a changes.txt entry here too?!
          Hide
          Robert Muir added a comment -

          Simon, yes see LUCENE-1689.
          this is my question of the day, how are we handling this which is really a backwards break in a way, but honestly a bugfix because we should have supported Unicode 4.0 in Lucene 3.0, since thats the unicode version of java 5.

          Show
          Robert Muir added a comment - Simon, yes see LUCENE-1689 . this is my question of the day, how are we handling this which is really a backwards break in a way, but honestly a bugfix because we should have supported Unicode 4.0 in Lucene 3.0, since thats the unicode version of java 5.
          Hide
          Uwe Schindler added a comment -

          we can change it whenever we want, we must only supply a matchVersion switch....

          Show
          Uwe Schindler added a comment - we can change it whenever we want, we must only supply a matchVersion switch....
          Hide
          Robert Muir added a comment -

          Uwe, we can use matchVersion for all of this, this is true, and I will help.

          but see my comment on LUCENE-1689 (since i feel it affects all the issues), it will result in a lot of code complexity. Just a warning.

          Show
          Robert Muir added a comment - Uwe, we can use matchVersion for all of this, this is true, and I will help. but see my comment on LUCENE-1689 (since i feel it affects all the issues), it will result in a lot of code complexity. Just a warning.
          Hide
          Robert Muir added a comment -

          here is a patch that supports the old broken behavior also via Version.

          Show
          Robert Muir added a comment - here is a patch that supports the old broken behavior also via Version.
          Hide
          Robert Muir added a comment -

          forgot javadocs describing what the version does, sorry.

          Show
          Robert Muir added a comment - forgot javadocs describing what the version does, sorry.
          Hide
          Robert Muir added a comment -

          if you want my vote, it is that we treat issues like this as bugs and not do all this Version stuff.

          i supplied this patch (22KB versus 2KB) to show how even the smallest issue creates more complexity.
          Also, read the javadocs for what Version does, it reads just like a bug:

          • As of 3.1, supplementary characters are properly lowercased.

          I mean, honestly, its not like we provided a back compat mechanism for 3.0,
          where this behavior changed for lots of contrib that uses String-based methods, such as String toLowerCase (they return different results on JRE5 than JRE4)

          but we can go either way, doesn't matter to me.

          Show
          Robert Muir added a comment - if you want my vote, it is that we treat issues like this as bugs and not do all this Version stuff. i supplied this patch (22KB versus 2KB) to show how even the smallest issue creates more complexity. Also, read the javadocs for what Version does, it reads just like a bug: As of 3.1, supplementary characters are properly lowercased. I mean, honestly, its not like we provided a back compat mechanism for 3.0, where this behavior changed for lots of contrib that uses String-based methods, such as String toLowerCase (they return different results on JRE5 than JRE4) but we can go either way, doesn't matter to me.
          Hide
          Mark Miller added a comment -

          But we try and maintain index back compatibility with bugs too? We don't want terms to be lost in an index.

          But it depends as always - if something has long been a problem and broken, then perhaps it doesn't make sense to bend over backwards about it now. We just have to look at everything, put the priority on making life best for users while balancing somewhat with dev/maintenance headaches and come to a consensus - easy !

          Show
          Mark Miller added a comment - But we try and maintain index back compatibility with bugs too? We don't want terms to be lost in an index. But it depends as always - if something has long been a problem and broken, then perhaps it doesn't make sense to bend over backwards about it now. We just have to look at everything, put the priority on making life best for users while balancing somewhat with dev/maintenance headaches and come to a consensus - easy !
          Hide
          Robert Muir added a comment -

          Mark, true, well give me some consensus so when 3.0 is released, we can start attacking these issues!

          doesn't matter to me, I just present both alternatives! all i want is for us to make a decision.

          Show
          Robert Muir added a comment - Mark, true, well give me some consensus so when 3.0 is released, we can start attacking these issues! doesn't matter to me, I just present both alternatives! all i want is for us to make a decision.
          Hide
          Simon Willnauer added a comment -

          Simon, those "wierd" chars are indeed real codepoints that have lowercasing behavior in Unicode 4.0!

          thats what I guessed otherwise it would not work though . I was just wondering if there are some more expressive once out there.

          Mark, true, well give me some consensus so when 3.0 is released, we can start attacking these issues!

          +1

          Show
          Simon Willnauer added a comment - Simon, those "wierd" chars are indeed real codepoints that have lowercasing behavior in Unicode 4.0! thats what I guessed otherwise it would not work though . I was just wondering if there are some more expressive once out there. Mark, true, well give me some consensus so when 3.0 is released, we can start attacking these issues! +1
          Hide
          Robert Muir added a comment -

          But we try and maintain index back compatibility with bugs too?

          Mark, you are right. The Version description says this: Match settings and bugs in Lucene's 3.0 release.
          I guess we should at least try, I think we can do it.

          Show
          Robert Muir added a comment - But we try and maintain index back compatibility with bugs too? Mark, you are right. The Version description says this: Match settings and bugs in Lucene's 3.0 release. I guess we should at least try, I think we can do it.
          Hide
          Simon Willnauer added a comment -

          I revised the patch and fixed some issues:

          • replaced real characters in tests
          • extended tests to boundaries
          • Removed "code duplication" in LowercaseFilter

          the latter is the most important issue. I figured that if we implement a factory with the basic codePointAt method based on a version we can implement the most of the algorithms / methods just by obtaining the version correspondent instance of CharacterUtils (new class I introduced) What this class does is pretty simple - if version >= 3.1 it delegates to the Character correspondent while for earlier versions it convert a character to a codepoint without checking the for high surrogates. Once we have done this conversion we can simply use all the Character.*(int) methods as they are.

          Show
          Simon Willnauer added a comment - I revised the patch and fixed some issues: replaced real characters in tests extended tests to boundaries Removed "code duplication" in LowercaseFilter the latter is the most important issue. I figured that if we implement a factory with the basic codePointAt method based on a version we can implement the most of the algorithms / methods just by obtaining the version correspondent instance of CharacterUtils (new class I introduced) What this class does is pretty simple - if version >= 3.1 it delegates to the Character correspondent while for earlier versions it convert a character to a codepoint without checking the for high surrogates. Once we have done this conversion we can simply use all the Character.*(int) methods as they are.
          Hide
          Simon Willnauer added a comment -

          btw. this also works for CharArraySet - that way we can easily implement it with Version without duplicating any code. Readable, clean and compatible.

          I will update the CharArraySet patch once I got comments on this.

          Show
          Simon Willnauer added a comment - btw. this also works for CharArraySet - that way we can easily implement it with Version without duplicating any code. Readable, clean and compatible. I will update the CharArraySet patch once I got comments on this.
          Hide
          Robert Muir added a comment -

          Hi Simon, this is a cool idea!

          I need to think this through, can you think of other places (non-lowercasing) where we could use this?
          Even if we can only use it there, I think it might still be a good idea to keep things simple.

          I do think we should mark the class deprecated and only used for lucene back compat purposes if we decide to use it.

          Show
          Robert Muir added a comment - Hi Simon, this is a cool idea! I need to think this through, can you think of other places (non-lowercasing) where we could use this? Even if we can only use it there, I think it might still be a good idea to keep things simple. I do think we should mark the class deprecated and only used for lucene back compat purposes if we decide to use it.
          Hide
          Robert Muir added a comment -

          Simon, i took a quick look at contrib analyzers, for example.
          This utility class could make back compat easier for a lot of the code, i.e. unicode block calculations in the CJK code, greek diacritic/lowercase folding in the greek code, ...
          I think we should go this route.

          Show
          Robert Muir added a comment - Simon, i took a quick look at contrib analyzers, for example. This utility class could make back compat easier for a lot of the code, i.e. unicode block calculations in the CJK code, greek diacritic/lowercase folding in the greek code, ... I think we should go this route.
          Hide
          Simon Willnauer added a comment -

          I also found some others like
          BrazilianStemmer
          ChineseTokenizer
          FrenchStemmer
          DutchStemmer

          and many more.... +1 for this from my side.
          As this seems to be fundamental we should try to get it in sooner or later so we can get the rest going.

          simon

          Show
          Simon Willnauer added a comment - I also found some others like BrazilianStemmer ChineseTokenizer FrenchStemmer DutchStemmer and many more.... +1 for this from my side. As this seems to be fundamental we should try to get it in sooner or later so we can get the rest going. simon
          Hide
          Simon Willnauer added a comment -

          Added CHANGES.TXT entry for this new feature.
          We both agreed that we can deprecated CharacterUtils later once we are close to getting rid of it.

          Show
          Simon Willnauer added a comment - Added CHANGES.TXT entry for this new feature. We both agreed that we can deprecated CharacterUtils later once we are close to getting rid of it.
          Hide
          Robert Muir added a comment -

          Thanks for your work here Simon. I will commit soon if no one objects.

          Show
          Robert Muir added a comment - Thanks for your work here Simon. I will commit soon if no one objects.
          Hide
          Robert Muir added a comment -

          damn we have to use the limit form of codePointAt, just to be sure.

          if term text truly ends with unpaired lead surrogate, codePointAt could pair it with leftover trash trail surrogate from a previous token...

          Show
          Robert Muir added a comment - damn we have to use the limit form of codePointAt, just to be sure. if term text truly ends with unpaired lead surrogate, codePointAt could pair it with leftover trash trail surrogate from a previous token...
          Hide
          Simon Willnauer added a comment - - edited

          damn we have to use the limit form of codePointAt, just to be sure.

          no we don't - at least not in this particular case

          if term text truly ends with unpaired lead surrogate, codePointAt could pair it with leftover trash trail surrogate from a previous token...

          if this rare situation occurs the term length will still prevent the changed trail surrogate from being part of the token. This includes a super tiny overhead but I guess we can simply ignore this. The lead surrogate will not be changed at all in this case - if there is a situation where this could happen I'm not aware of it!

          Show
          Simon Willnauer added a comment - - edited damn we have to use the limit form of codePointAt, just to be sure. no we don't - at least not in this particular case if term text truly ends with unpaired lead surrogate, codePointAt could pair it with leftover trash trail surrogate from a previous token... if this rare situation occurs the term length will still prevent the changed trail surrogate from being part of the token. This includes a super tiny overhead but I guess we can simply ignore this. The lead surrogate will not be changed at all in this case - if there is a situation where this could happen I'm not aware of it!
          Hide
          Simon Willnauer added a comment -

          If we are too desperate about it I would suggest to have something like the following just above the loop:

           if(buffer.length >= length)
                  buffer[length] = 0x00;
          
          Show
          Simon Willnauer added a comment - If we are too desperate about it I would suggest to have something like the following just above the loop: if (buffer.length >= length) buffer[length] = 0x00;
          Hide
          Simon Willnauer added a comment -

          I updated the patch with another testcase for a trailing surrogate leftover in the termbuffer. I also added a missing @Override and fixed some wording in the javadoc

          Show
          Simon Willnauer added a comment - I updated the patch with another testcase for a trailing surrogate leftover in the termbuffer. I also added a missing @Override and fixed some wording in the javadoc
          Hide
          Uwe Schindler added a comment -

          Looks good, +1 to commit!

          Show
          Uwe Schindler added a comment - Looks good, +1 to commit!
          Hide
          Robert Muir added a comment -

          Simon you are right, there is no problem.

          maybe for other things in the future we will need codePointAt() with the limit param, we could just add it to CharacterUtils if/when we need it.

          Show
          Robert Muir added a comment - Simon you are right, there is no problem. maybe for other things in the future we will need codePointAt() with the limit param, we could just add it to CharacterUtils if/when we need it.
          Hide
          Robert Muir added a comment -

          Committed revision 885024.

          Show
          Robert Muir added a comment - Committed revision 885024.

            People

            • Assignee:
              Robert Muir
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development