Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.

      supplementary character support should be fixed for code that works with char/char[]

      For example:
      StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
      LowercaseFilter should be modified to lowercase suppl. characters correctly.
      CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.

      in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

      1. testCurrentBehavior.txt
        8 kB
        Robert Muir
      2. LUCENE-1689.patch
        7 kB
        Robert Muir
      3. LUCENE-1689.patch
        19 kB
        Robert Muir
      4. LUCENE-1689.patch
        52 kB
        Robert Muir
      5. LUCENE-1689_lowercase_example.txt
        1.0 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          an example of how LowercaseFilter might be fixed.
          I only changed the new-api method for demonstration purposes.

          Show
          Robert Muir added a comment - an example of how LowercaseFilter might be fixed. I only changed the new-api method for demonstration purposes.
          Hide
          Robert Muir added a comment -

          my example has some "issues" (i.e. it depends upon the knowledge that no surrogate pairs lowercase to BMP codepoints).
          it also doesn't correctly handle invalid unicode data (unpaired surrogates).

          Such complexity can be all added to the slower-path supplementary case. the intent is just to show it wouldnt be a terribly invasive change.
          Thanks!

          Show
          Robert Muir added a comment - my example has some "issues" (i.e. it depends upon the knowledge that no surrogate pairs lowercase to BMP codepoints). it also doesn't correctly handle invalid unicode data (unpaired surrogates). Such complexity can be all added to the slower-path supplementary case. the intent is just to show it wouldnt be a terribly invasive change. Thanks!
          Hide
          Robert Muir added a comment -

          icu uses the following idiom to check if a char ch is a surrogate. might be the fastest way to ensure the performance impact is minimal:

          (ch & 0xFFFFF800) == 0xD800

          Show
          Robert Muir added a comment - icu uses the following idiom to check if a char ch is a surrogate. might be the fastest way to ensure the performance impact is minimal: (ch & 0xFFFFF800) == 0xD800
          Hide
          Michael McCandless added a comment -

          Robert, could you flesh this patch out to a committable point? Ie, handle unpaired surrogates, add test case that first shows that LowercaseFilter incorrectly breaks up surrogates? Thanks!

          it depends upon the knowledge that no surrogate pairs lowercase to BMP codepoints

          Is it invalid to make this assumption? Ie, does the unicode standard not guarantee it?

          Show
          Michael McCandless added a comment - Robert, could you flesh this patch out to a committable point? Ie, handle unpaired surrogates, add test case that first shows that LowercaseFilter incorrectly breaks up surrogates? Thanks! it depends upon the knowledge that no surrogate pairs lowercase to BMP codepoints Is it invalid to make this assumption? Ie, does the unicode standard not guarantee it?
          Hide
          Robert Muir added a comment -

          Michael: LowerCaseFilter doesn't incorrectly break surrogates, it just won't lowercase supplementary codepoints.

          But I can get it in shape, sure.

          Show
          Robert Muir added a comment - Michael: LowerCaseFilter doesn't incorrectly break surrogates, it just won't lowercase supplementary codepoints. But I can get it in shape, sure.
          Hide
          Simon Willnauer added a comment -

          The scary thing is that this happens already if you run lucene on a 1.5 VM even without introducing 1.5 code.
          I think we need to act on this issue asap and release it together with 3.0. -> ful support for unicode 4.0 in lucene 3.0
          I also thought about the implementation a little bit. The need to detect chars > BMP and operate on those might be spread out across lucene (quite a couple of analyzers and filters etc). Performance could truely suffer from this if it is done "wrong" or even more than once. It might be considerable to make the detection pluggable with an initial filter that only checks where surrogates are present in a token and sets an indicator to the token represenation so that subsequent TokenStreams can operate on it without rechecking. This would also preserve performance for those who do not need chars > BMP (which could be quite a large amout of people). Those could then simply not supply such a initial filter.

          Just a couple of random thoughts.

          Show
          Simon Willnauer added a comment - The scary thing is that this happens already if you run lucene on a 1.5 VM even without introducing 1.5 code. I think we need to act on this issue asap and release it together with 3.0. -> ful support for unicode 4.0 in lucene 3.0 I also thought about the implementation a little bit. The need to detect chars > BMP and operate on those might be spread out across lucene (quite a couple of analyzers and filters etc). Performance could truely suffer from this if it is done "wrong" or even more than once. It might be considerable to make the detection pluggable with an initial filter that only checks where surrogates are present in a token and sets an indicator to the token represenation so that subsequent TokenStreams can operate on it without rechecking. This would also preserve performance for those who do not need chars > BMP (which could be quite a large amout of people). Those could then simply not supply such a initial filter. Just a couple of random thoughts.
          Hide
          Robert Muir added a comment -

          Simon, I want to address your comment on performance.
          I think that surrogate detection is cheap when done right and I don't think there's a ton of places that need changes.
          But I don't think any indicator is really appropriate, for example my TokenFilter might want to convert one chinese character in the BMP to another one outside of the BMP. It is all unicode.

          But there is more than just analysis involved here, for example I have not tested WildcardQuery: ? operator.
          I'm not trying to go berzerko and be 'ultra-correct', but basic things like that should work.
          For situations where its not worth it, i.e. FuzzyQuery's scoring, we should just doc that the calculation is based on 'code units', and leave it alone.

          Show
          Robert Muir added a comment - Simon, I want to address your comment on performance. I think that surrogate detection is cheap when done right and I don't think there's a ton of places that need changes. But I don't think any indicator is really appropriate, for example my TokenFilter might want to convert one chinese character in the BMP to another one outside of the BMP. It is all unicode. But there is more than just analysis involved here, for example I have not tested WildcardQuery: ? operator. I'm not trying to go berzerko and be 'ultra-correct', but basic things like that should work. For situations where its not worth it, i.e. FuzzyQuery's scoring, we should just doc that the calculation is based on 'code units', and leave it alone.
          Hide
          Robert Muir added a comment -

          I am curious how you plan on approaching backwards compat?

          #1 The patch is needed for correct java 1.5 behavior, its a 1.5 migration issue.
          #2 The patch won't work on java 1.4 because the jdk does not supply the needed functionality
          #3 Its my understanding you like to deprecate things (say CharTokenizer) and give people a transition time. How will this work?

          i intend to supply patch that fixes the problems, but I wanted to bring up those issues...

          Show
          Robert Muir added a comment - I am curious how you plan on approaching backwards compat? #1 The patch is needed for correct java 1.5 behavior, its a 1.5 migration issue. #2 The patch won't work on java 1.4 because the jdk does not supply the needed functionality #3 Its my understanding you like to deprecate things (say CharTokenizer) and give people a transition time. How will this work? i intend to supply patch that fixes the problems, but I wanted to bring up those issues...
          Hide
          Robert Muir added a comment -

          Simon, a response to your other comment 'I think we need to act on this issue asap and release it together with 3.0. -> full support for unicode 4.0 in lucene 3.0'

          Full unicode support would be great!

          But this involves a whole lot more, I'm trying to create a contrib under LUCENE-1488 for "full unicode support".
          A lot of what is necessary isn't available in the Java 1.5 JDK, such as unicode normalization.

          Maybe we can settle for 'full support for UCS (Unicode Character Set)' as a start!

          Show
          Robert Muir added a comment - Simon, a response to your other comment 'I think we need to act on this issue asap and release it together with 3.0. -> full support for unicode 4.0 in lucene 3.0' Full unicode support would be great! But this involves a whole lot more, I'm trying to create a contrib under LUCENE-1488 for "full unicode support". A lot of what is necessary isn't available in the Java 1.5 JDK, such as unicode normalization. Maybe we can settle for 'full support for UCS (Unicode Character Set)' as a start!
          Hide
          Robert Muir added a comment -

          i forgot to answer your question Michael:

          it depends upon the knowledge that no surrogate pairs lowercase to BMP codepoints
          Is it invalid to make this assumption? Ie, does the unicode standard not guarantee it?

          I do not think it guarantees this for all future unicode versions. In my opinion, we should exploit things like this if I can show a test case that proves its true for all codepoint in the current version of unicode
          And it should be documented that this could possibly change in some future version.
          In this example, its a nice simplification because it guarantees the length (in code units) will not change!

          I think for a next step on this issue I will create and upload a test case showing the issues and detailing some possible solutions.
          For some of them, maybe a javadoc update is the most appropriate, but for others, maybe an API change is the right way to go.
          Then we can figure out what should be done.

          Show
          Robert Muir added a comment - i forgot to answer your question Michael: it depends upon the knowledge that no surrogate pairs lowercase to BMP codepoints Is it invalid to make this assumption? Ie, does the unicode standard not guarantee it? I do not think it guarantees this for all future unicode versions. In my opinion, we should exploit things like this if I can show a test case that proves its true for all codepoint in the current version of unicode And it should be documented that this could possibly change in some future version. In this example, its a nice simplification because it guarantees the length (in code units) will not change! I think for a next step on this issue I will create and upload a test case showing the issues and detailing some possible solutions. For some of them, maybe a javadoc update is the most appropriate, but for others, maybe an API change is the right way to go. Then we can figure out what should be done.
          Hide
          Yonik Seeley added a comment -

          I am curious how you plan on approaching backwards compat?

          I don't see a real back compat issue... I can't imagine anyone relying on the fact that >BMP chars wouldn't be lowercased. To rely on that would also be relying on undocumented behavior.

          It seems like this (handling code points beyond the BMP) is really a collection of issues that could be committed independently when ready?

          Moving to 3.1 since it requires Java 1.5 for much of it.
          If there's something desirable to slip in for 2.9 that doesn't depend on Java5, we can still do that.

          Show
          Yonik Seeley added a comment - I am curious how you plan on approaching backwards compat? I don't see a real back compat issue... I can't imagine anyone relying on the fact that >BMP chars wouldn't be lowercased. To rely on that would also be relying on undocumented behavior. It seems like this (handling code points beyond the BMP) is really a collection of issues that could be committed independently when ready? Moving to 3.1 since it requires Java 1.5 for much of it. If there's something desirable to slip in for 2.9 that doesn't depend on Java5, we can still do that.
          Hide
          Robert Muir added a comment -

          Yonik, I think the problem is where method signatures must change, such as CharTokenizer, required to fix LetterTokenizer and friends.

          These are probably the worst offenders, as a lot of these tokenizers just remove >BMP chars, which is really nasty behavior for CJK.

          I agree its a collection of issues but maybe there can be an overall strategy?

          Show
          Robert Muir added a comment - Yonik, I think the problem is where method signatures must change, such as CharTokenizer, required to fix LetterTokenizer and friends. These are probably the worst offenders, as a lot of these tokenizers just remove >BMP chars, which is really nasty behavior for CJK. I agree its a collection of issues but maybe there can be an overall strategy?
          Hide
          Yonik Seeley added a comment -

          onik, I think the problem is where method signatures must change, such as CharTokenizer, required to fix LetterTokenizer and friends.

          Ah, OK. Actually it just occurred to me that this would also require reindexing, otherwise queries that hit documents in the past would mysteriously start missing them (for text outside the BMP).

          Show
          Yonik Seeley added a comment - onik, I think the problem is where method signatures must change, such as CharTokenizer, required to fix LetterTokenizer and friends. Ah, OK. Actually it just occurred to me that this would also require reindexing, otherwise queries that hit documents in the past would mysteriously start missing them (for text outside the BMP).
          Hide
          Robert Muir added a comment -

          this is just a patch with testcases showing the existing behavior.

          perhaps these should be fixed:
          Simple/Standard/StopAnalyzer/etc: deletes all supp. characters completely.
          LowerCaseFilter: doesn't lowercase supp. characters correctly.
          WildcardQuery: ? operator does not work correctly.

          perhaps these just need some javadocs:
          FuzzyQuery: scoring is strange because its based upon surrogates, leave alone and javadoc it.
          LengthFilter: length is calculated based on utf-16 code units, leave alone and javadoc it.

          ... and theres always the option to not change any code, but just javadoc all the behavior as a "fix", providing stuff in contrib or elsewhere that works correctly.
          let me know what you think.

          Show
          Robert Muir added a comment - this is just a patch with testcases showing the existing behavior. perhaps these should be fixed: Simple/Standard/StopAnalyzer/etc: deletes all supp. characters completely. LowerCaseFilter: doesn't lowercase supp. characters correctly. WildcardQuery: ? operator does not work correctly. perhaps these just need some javadocs: FuzzyQuery: scoring is strange because its based upon surrogates, leave alone and javadoc it. LengthFilter: length is calculated based on utf-16 code units, leave alone and javadoc it. ... and theres always the option to not change any code, but just javadoc all the behavior as a "fix", providing stuff in contrib or elsewhere that works correctly. let me know what you think.
          Hide
          Simon Willnauer added a comment -

          Robert, are you still working on this issue? I have seen a couple of questions about supplementary character support on the mailinig-lists though - I will see how much time I have during the next weeks to get this going.

          simon

          Show
          Simon Willnauer added a comment - Robert, are you still working on this issue? I have seen a couple of questions about supplementary character support on the mailinig-lists though - I will see how much time I have during the next weeks to get this going. simon
          Hide
          Robert Muir added a comment - - edited

          Simon, not yet. I was trying to solicit some opinions on what the best approach should be. I realize this is a very special case.

          Show
          Robert Muir added a comment - - edited Simon, not yet. I was trying to solicit some opinions on what the best approach should be. I realize this is a very special case.
          Hide
          Robert Muir added a comment -

          one tricky part of this is CharTokenizer. a few things subclass this guy.

          • Sometimes it is ok (WhitespaceTokenizer, because all whitespace in unicode <= 0xFFFF)
          • Sometimes it is not (LetterTokenizer)

          It depends how its being used. So I think instead of just

          abstract boolean isTokenChar(char)
          char normalize(char)
          

          we can add additional int-based methods, where the default implementation is to defer to the char-based ones.

          This would preserve backwards compatibility.

          Show
          Robert Muir added a comment - one tricky part of this is CharTokenizer. a few things subclass this guy. Sometimes it is ok (WhitespaceTokenizer, because all whitespace in unicode <= 0xFFFF) Sometimes it is not (LetterTokenizer) It depends how its being used. So I think instead of just abstract boolean isTokenChar( char ) char normalize( char ) we can add additional int-based methods, where the default implementation is to defer to the char-based ones. This would preserve backwards compatibility.
          Hide
          Michael McCandless added a comment -

          That sounds right.

          We would also deprecate the char based methods? And note that in 3.0 we'll remove them, and make the int-based methods abstract?

          And, then fix CharTokenizer to properly handle the extended chars?

          We may need to use reflection to see if the current subclass "wants" chars or ints, and fallback to the old (current) char processing if it's a subclass that doesn't define the int-based methods.

          Show
          Michael McCandless added a comment - That sounds right. We would also deprecate the char based methods? And note that in 3.0 we'll remove them, and make the int-based methods abstract? And, then fix CharTokenizer to properly handle the extended chars? We may need to use reflection to see if the current subclass "wants" chars or ints, and fallback to the old (current) char processing if it's a subclass that doesn't define the int-based methods.
          Hide
          Robert Muir added a comment -

          michael yes thats what I had in mind.

          like you mentioned, the non-final CharTokenizer-based things (Whitespace, Letter) are a problem.

          The easiest way (it looks) is to put the reflection inside the non-final CharTokenizers: Whitespace and Letter.
          I want them to work correctly if you have not subclassed them, but if you have, it should call your char based method (which wont work for supp characters, but it didnt before)!

          thoughts?

          Show
          Robert Muir added a comment - michael yes thats what I had in mind. like you mentioned, the non-final CharTokenizer-based things (Whitespace, Letter) are a problem. The easiest way (it looks) is to put the reflection inside the non-final CharTokenizers: Whitespace and Letter. I want them to work correctly if you have not subclassed them, but if you have, it should call your char based method (which wont work for supp characters, but it didnt before)! thoughts?
          Hide
          Robert Muir added a comment -

          just a dump to show where I am trying to go.
          I havent done anything to Whitespace/Letter yet, but you can see the changes to CharTokenizer.

          Show
          Robert Muir added a comment - just a dump to show where I am trying to go. I havent done anything to Whitespace/Letter yet, but you can see the changes to CharTokenizer.
          Hide
          Robert Muir added a comment -

          by the way i plan to replace any scary bit operations with UnicodeUtil functions. JDK doesnt have methods to yank the lead or trail surrogate out of an int, except in sun.text.normalizer of course

          Show
          Robert Muir added a comment - by the way i plan to replace any scary bit operations with UnicodeUtil functions. JDK doesnt have methods to yank the lead or trail surrogate out of an int, except in sun.text.normalizer of course
          Hide
          Robert Muir added a comment -

          I think instead of the way I prototyped in the first patch, it might be better to have the original chartokenizer incrementToken definition still available in the code.
          this is some temporary code duplication but would perform better for the backwards compat case, and the backwards compatibility would be more clear to me at least.

          Show
          Robert Muir added a comment - I think instead of the way I prototyped in the first patch, it might be better to have the original chartokenizer incrementToken definition still available in the code. this is some temporary code duplication but would perform better for the backwards compat case, and the backwards compatibility would be more clear to me at least.
          Hide
          Michael McCandless added a comment -

          I think instead of the way I prototyped in the first patch, it might be better to have the original chartokenizer incrementToken definition still available in the code.
          this is some temporary code duplication but would perform better for the backwards compat case, and the backwards compatibility would be more clear to me at least.

          I suppose we could simply make an entirely new class, which properly handles surrogates, and deprecate CharTokenizer in favor of it? Likewise we'd have to make new classes for the current subclasses of CharTokenizer (Whitespace,LetterTokenizer). That would simplify being back compatible.

          Show
          Michael McCandless added a comment - I think instead of the way I prototyped in the first patch, it might be better to have the original chartokenizer incrementToken definition still available in the code. this is some temporary code duplication but would perform better for the backwards compat case, and the backwards compatibility would be more clear to me at least. I suppose we could simply make an entirely new class, which properly handles surrogates, and deprecate CharTokenizer in favor of it? Likewise we'd have to make new classes for the current subclasses of CharTokenizer (Whitespace,LetterTokenizer). That would simplify being back compatible.
          Hide
          Robert Muir added a comment -

          Michael, I do think that would be the simplest, but most invasive (we would have to make new duplicate classes for Arabic & Russian even!)
          i think i will try to create a patch for my 2nd method, and if it cannot be made simple, we can always do the new class/deprecation route?

          Show
          Robert Muir added a comment - Michael, I do think that would be the simplest, but most invasive (we would have to make new duplicate classes for Arabic & Russian even!) i think i will try to create a patch for my 2nd method, and if it cannot be made simple, we can always do the new class/deprecation route?
          Hide
          Robert Muir added a comment -

          patch with a different technique for CharTokenizer and friends. I like this much more.

          for non-final subclasses that are moved to the new int api, I had to expose a protected boolean switch so that any old subclasses of them work as expected: see Arabic/Russian/Letter/Whitespace for examples.

          Show
          Robert Muir added a comment - patch with a different technique for CharTokenizer and friends. I like this much more. for non-final subclasses that are moved to the new int api, I had to expose a protected boolean switch so that any old subclasses of them work as expected: see Arabic/Russian/Letter/Whitespace for examples.
          Hide
          Robert Muir added a comment -

          patch with some more work in contrib for this issue.

          Show
          Robert Muir added a comment - patch with some more work in contrib for this issue.
          Hide
          Uwe Schindler added a comment -

          In your latest patch you have this method getDeclaringClass() using the while loop. To get the class declaring a method, it is simplier to call

          this.getClass().getMethod(...).getDeclaringClass()

          See TokenStream.java for example where the same is done for the new TokenStream API.

          Show
          Uwe Schindler added a comment - In your latest patch you have this method getDeclaringClass() using the while loop. To get the class declaring a method, it is simplier to call this .getClass().getMethod(...).getDeclaringClass() See TokenStream.java for example where the same is done for the new TokenStream API.
          Hide
          Robert Muir added a comment -

          Uwe, unfortunately Class.getMethod only works for public methods!

          So I must use Class.getDeclaredMethod in this case, because the methods in question are protected. The problem with Class.getDeclaredMethod is that it does not check superclasses like Class.getMethod does, thus the loop.

          Show
          Robert Muir added a comment - Uwe, unfortunately Class.getMethod only works for public methods! So I must use Class.getDeclaredMethod in this case, because the methods in question are protected. The problem with Class.getDeclaredMethod is that it does not check superclasses like Class.getMethod does, thus the loop.
          Hide
          Uwe Schindler added a comment -

          OK, I did not know this. I read again the Class.getMethod() javadocs and it states that it only returns public methods. In the case of TokenStream bw compatibility, only public methods are affected, so I used the "simple and faster" way.

          I was already wondering why you changed this call, because the rest of the reflection code seemed to be copied from TokenStream e.g. the name of the parameter type arrays...

          Show
          Uwe Schindler added a comment - OK, I did not know this. I read again the Class.getMethod() javadocs and it states that it only returns public methods. In the case of TokenStream bw compatibility, only public methods are affected, so I used the "simple and faster" way. I was already wondering why you changed this call, because the rest of the reflection code seemed to be copied from TokenStream e.g. the name of the parameter type arrays...
          Hide
          Robert Muir added a comment -

          Uwe, and I thank you for the good example! I tried to apply your same logic but was frustrated when it wasn't working only to discover this annoyance with java reflection.

          Show
          Robert Muir added a comment - Uwe, and I thank you for the good example! I tried to apply your same logic but was frustrated when it wasn't working only to discover this annoyance with java reflection.
          Hide
          Uwe Schindler added a comment - - edited

          For me it is strange, that getMethod() does not work for protected methods, because they can be inherited. That it does not work for private methods is clear, because they are only known to the class itsself (you cannot even override a private methods). But the JDK defines it in that way and we must life with it

          I would stop the iteration when getSuperclass() returns null and then return null (which cannot happen, as the method is defined at least in CharTokenizer, you wrote this in the comment).

          private Class getDeclaringClass(String name, Class[] params) {
            for (Class clazz = this.getClass(); clazz != null; clazz = clazz.getSuperclass()) {
              try {
                clazz.getDeclaredMethod(name, params);
                return clazz;
              } catch (NoSuchMethodException e) {}
            }
            return null; /* impossible */
          }
          
          Show
          Uwe Schindler added a comment - - edited For me it is strange, that getMethod() does not work for protected methods, because they can be inherited. That it does not work for private methods is clear, because they are only known to the class itsself (you cannot even override a private methods). But the JDK defines it in that way and we must life with it I would stop the iteration when getSuperclass() returns null and then return null (which cannot happen, as the method is defined at least in CharTokenizer, you wrote this in the comment). private Class getDeclaringClass( String name, Class [] params) { for ( Class clazz = this .getClass(); clazz != null ; clazz = clazz.getSuperclass()) { try { clazz.getDeclaredMethod(name, params); return clazz; } catch (NoSuchMethodException e) {} } return null ; /* impossible */ }
          Hide
          Robert Muir added a comment -

          Uwe, thanks. I will modify the loop in that way, it is much easier to read.

          Show
          Robert Muir added a comment - Uwe, thanks. I will modify the loop in that way, it is much easier to read.
          Hide
          Robert Muir added a comment -

          this is just a reminder that I think if we go this route, we will need caching similar to what Uwe added for TokenStream back compat so that construction performance is not bad.

          Show
          Robert Muir added a comment - this is just a reminder that I think if we go this route, we will need caching similar to what Uwe added for TokenStream back compat so that construction performance is not bad.
          Hide
          Robert Muir added a comment -

          a couple people have asked me about this issue lately, I would prefer to spin off smaller issues rather than create large patches that become out of date.

          also I think Simon is interested in working on some of this, so more jira spam but i think easier to make progress.

          Show
          Robert Muir added a comment - a couple people have asked me about this issue lately, I would prefer to spin off smaller issues rather than create large patches that become out of date. also I think Simon is interested in working on some of this, so more jira spam but i think easier to make progress.
          Hide
          Robert Muir added a comment -

          Yonik, or anyone else, please let me know your thoughts on the following:

          I don't see a real back compat issue... I can't imagine anyone relying on the fact that >BMP chars wouldn't be lowercased. To rely on that would also be relying on undocumented behavior.

          Ah, OK. Actually it just occurred to me that this would also require reindexing, otherwise queries that hit documents in the past would mysteriously start missing them (for text outside the BMP).

          what should be our approach here wrt index back compat?
          For the issues mentioned here, I cant possibly see >BMP working currently for anyone, but you are right it will change results.

          I don't want to break index back compat, just wanted to mention that introducing Unicode 4 support, still with API back compat, with no performance degradation, is going to be somewhat challenging already.
          If we want to somehow support the "broken" analysis components for index back compat, then we have to also have a broken implementation available on top of the correct impl (probably using Version to handle this).
          In my opinion, this would introduce a lot of complexity, I will help do it though, if we must.

          Show
          Robert Muir added a comment - Yonik, or anyone else, please let me know your thoughts on the following: I don't see a real back compat issue... I can't imagine anyone relying on the fact that >BMP chars wouldn't be lowercased. To rely on that would also be relying on undocumented behavior. Ah, OK. Actually it just occurred to me that this would also require reindexing, otherwise queries that hit documents in the past would mysteriously start missing them (for text outside the BMP). what should be our approach here wrt index back compat? For the issues mentioned here, I cant possibly see >BMP working currently for anyone, but you are right it will change results. I don't want to break index back compat, just wanted to mention that introducing Unicode 4 support, still with API back compat, with no performance degradation, is going to be somewhat challenging already. If we want to somehow support the "broken" analysis components for index back compat, then we have to also have a broken implementation available on top of the correct impl (probably using Version to handle this). In my opinion, this would introduce a lot of complexity, I will help do it though, if we must.
          Hide
          Robert Muir added a comment -

          btw, its worth mentioning that this whole concept of index back compat wrt Unicode is already completely broken in Lucene.

          if someone upgrades to lucene 3.0 and uses java 1.5, Character.* operations already return different results than on java 1.4, so they are already broken from their previous index.

          Show
          Robert Muir added a comment - btw, its worth mentioning that this whole concept of index back compat wrt Unicode is already completely broken in Lucene. if someone upgrades to lucene 3.0 and uses java 1.5, Character.* operations already return different results than on java 1.4, so they are already broken from their previous index.
          Hide
          Mark Miller added a comment -

          If there is nothing we can do here, then we just have to do the best we can -

          such as a prominent notice alerting that if you transition JVM's between building and searching the index and you are using or doing X, things will break.

          We should put this in a spot that is always pretty visible - perhaps even a new readme file titlted something like IndexBackwardCompatibility or something, to which we can add other tips and gotchyas as they come up. Or MaintainingIndicesAcrossVersions, or FancyWhateverGetsYourAttentionAboutUpgradingStuff. Or a permanent entry/sticky entry at the top of Changes.

          Show
          Mark Miller added a comment - If there is nothing we can do here, then we just have to do the best we can - such as a prominent notice alerting that if you transition JVM's between building and searching the index and you are using or doing X, things will break. We should put this in a spot that is always pretty visible - perhaps even a new readme file titlted something like IndexBackwardCompatibility or something, to which we can add other tips and gotchyas as they come up. Or MaintainingIndicesAcrossVersions, or FancyWhateverGetsYourAttentionAboutUpgradingStuff. Or a permanent entry/sticky entry at the top of Changes.
          Hide
          Robert Muir added a comment -

          If there is nothing we can do here, then we just have to do the best we can -

          Oh there is definitely a choice. and as I said, I am willing to help either way we go.
          I am just warning there will be a lot more complexity if we want to use say, Version to support both broken implementation and fixed implementation.
          lots of tests, and hopefully no bugs

          Uwe already commented he would like it controlled with Version, just saying suppl. characters are notoriously tricky as it is!

          Show
          Robert Muir added a comment - If there is nothing we can do here, then we just have to do the best we can - Oh there is definitely a choice. and as I said, I am willing to help either way we go. I am just warning there will be a lot more complexity if we want to use say, Version to support both broken implementation and fixed implementation. lots of tests, and hopefully no bugs Uwe already commented he would like it controlled with Version, just saying suppl. characters are notoriously tricky as it is!
          Hide
          Mark Miller added a comment -

          I'm speaking in regards to:

          btw, its worth mentioning that this whole concept of index back compat wrt Unicode is already completely broken in Lucene.
          if someone upgrades to lucene 3.0 and uses java 1.5, Character.* operations already return different results than on java 1.4, so they are already broken from their previous index.

          We can fix that too? If so, I think we should. The tokenstream backcompat was a bitch as well - but we had to slog through it.

          Show
          Mark Miller added a comment - I'm speaking in regards to: btw, its worth mentioning that this whole concept of index back compat wrt Unicode is already completely broken in Lucene. if someone upgrades to lucene 3.0 and uses java 1.5, Character.* operations already return different results than on java 1.4, so they are already broken from their previous index. We can fix that too? If so, I think we should. The tokenstream backcompat was a bitch as well - but we had to slog through it.
          Hide
          Robert Muir added a comment -

          We can fix that too? If so, I think we should. The tokenstream backcompat was a bitch as well - but we had to slog through it.

          Mark honestly, I do not yet know how this one could be reasonably done.
          The problem is the behavior is dependent a lot upon the users JVM, how in the hell do we know which JVM version was used to create some given index????

          Show
          Robert Muir added a comment - We can fix that too? If so, I think we should. The tokenstream backcompat was a bitch as well - but we had to slog through it. Mark honestly, I do not yet know how this one could be reasonably done. The problem is the behavior is dependent a lot upon the users JVM, how in the hell do we know which JVM version was used to create some given index????
          Hide
          Steve Rowe added a comment -

          I don't know if this is the right place to point this out, but: JFlex-generated scanners (e.g. StandardAnalyzer) do not properly handle supplementary characters.

          Unfortunately, it looks like the as-yet-unreleased JFlex 1.5 will not support supplementary characters either, so this will be a gap in Lucene's Unicode handling for a while.

          Show
          Steve Rowe added a comment - I don't know if this is the right place to point this out, but: JFlex-generated scanners (e.g. StandardAnalyzer) do not properly handle supplementary characters. Unfortunately, it looks like the as-yet-unreleased JFlex 1.5 will not support supplementary characters either, so this will be a gap in Lucene's Unicode handling for a while.
          Hide
          Mark Miller added a comment -

          Mark honestly, I do not yet know how this one could be reasonably done.

          Then thats what I am saying we should be documenting.

          how in the hell do we know which JVM version was used to create some given index????

          If something like this made sense long term, we can start putting it in as index metadata - not helpful now, but could end up being so.

          Show
          Mark Miller added a comment - Mark honestly, I do not yet know how this one could be reasonably done. Then thats what I am saying we should be documenting. how in the hell do we know which JVM version was used to create some given index???? If something like this made sense long term, we can start putting it in as index metadata - not helpful now, but could end up being so.
          Hide
          Robert Muir added a comment -

          Steven, no its definitely the right place to point it out! I know this is true, even with 1.5

          One reason I wanted to split this issue up was to try to make 'improvements', maybe we do not fix everything.
          there are also other options for StandardTokenizer/Jflex
          For instance, we could not break between any surrogate pairs and classify them as CJK (index individual character) for the time being.
          While technically incorrect, it would handle the common cases, i.e. ideographs from Big5-HKSCS, etc.

          but right now the topic is i guess on unicode and index back compat in general... trying to figure out what is the reasonable approach to handling this (supporting the old broken behavior/indexes created with them)

          Show
          Robert Muir added a comment - Steven, no its definitely the right place to point it out! I know this is true, even with 1.5 One reason I wanted to split this issue up was to try to make 'improvements', maybe we do not fix everything. there are also other options for StandardTokenizer/Jflex For instance, we could not break between any surrogate pairs and classify them as CJK (index individual character) for the time being. While technically incorrect, it would handle the common cases, i.e. ideographs from Big5-HKSCS, etc. but right now the topic is i guess on unicode and index back compat in general... trying to figure out what is the reasonable approach to handling this (supporting the old broken behavior/indexes created with them)
          Hide
          Robert Muir added a comment -

          Then thats what I am saying we should be documenting.

          +1 I think it should be in the release notes that if you upgrade Java version to either 1.5 or 1.7, you should reindex because Unicode versions change.
          Because of this, character properties change and some tokenizers, etc will emit different results.

          This is really only somewhat related to this issue, but thats what my email thread on java-dev was all about.

          Show
          Robert Muir added a comment - Then thats what I am saying we should be documenting. +1 I think it should be in the release notes that if you upgrade Java version to either 1.5 or 1.7, you should reindex because Unicode versions change. Because of this, character properties change and some tokenizers, etc will emit different results. This is really only somewhat related to this issue, but thats what my email thread on java-dev was all about.
          Hide
          Steve Rowe added a comment -

          Robert, is there anything left to do here? I think this issue can be resolved as fixed.

          Show
          Steve Rowe added a comment - Robert, is there anything left to do here? I think this issue can be resolved as fixed.
          Hide
          Steve Rowe added a comment -

          Resolving. Any remaining problems can be opened as separate issues.

          Show
          Steve Rowe added a comment - Resolving. Any remaining problems can be opened as separate issues.

            People

            • Assignee:
              Unassigned
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development