Lucene - Core
  1. Lucene - Core
  2. LUCENE-2183

Supplementary Character Handling in CharTokenizer

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      CharTokenizer is an abstract base class for all Tokenizers operating on a character level. Yet, those tokenizers still use char primitives instead of int codepoints. CharTokenizer should operate on codepoints and preserve bw compatibility.

      1. LUCENE-2183.patch
        25 kB
        Simon Willnauer
      2. LUCENE-2183.patch
        58 kB
        Simon Willnauer
      3. LUCENE-2183.patch
        58 kB
        Simon Willnauer
      4. LUCENE-2183.patch
        58 kB
        Simon Willnauer
      5. LUCENE-2183.patch
        60 kB
        Simon Willnauer
      6. LUCENE-2183.patch
        62 kB
        Uwe Schindler
      7. LUCENE-2183.patch
        65 kB
        Simon Willnauer

        Issue Links

          Activity

          Hide
          Simon Willnauer added a comment -

          linked issues

          Show
          Simon Willnauer added a comment - linked issues
          Hide
          Simon Willnauer added a comment -

          Here is a first proposal of a slightly new API that rather decouples the "isTokenChar" predicate and the "normalize" method instead of enforceing CharTokenizer subclasses and overriding methods.
          I introduced a new class TokenCharFunctions that is passed to the constructor of CharTokenizer and is used internally. The patch preserves full backwards compat and provides a clean way to move towards a consistent codepoint based API with would still provide the flexibility to make use of the old and buggy behavior even if the char based methods are removed.

          I consider this patch as a basis for the discussion how to solve this problem. other approaches would be using reflection like the TokenStream BWCompat layer does but I personally prefer not to use reflection and rather use delegation in favor of inheritance.

          looking forward to see some comments.

          Show
          Simon Willnauer added a comment - Here is a first proposal of a slightly new API that rather decouples the "isTokenChar" predicate and the "normalize" method instead of enforceing CharTokenizer subclasses and overriding methods. I introduced a new class TokenCharFunctions that is passed to the constructor of CharTokenizer and is used internally. The patch preserves full backwards compat and provides a clean way to move towards a consistent codepoint based API with would still provide the flexibility to make use of the old and buggy behavior even if the char based methods are removed. I consider this patch as a basis for the discussion how to solve this problem. other approaches would be using reflection like the TokenStream BWCompat layer does but I personally prefer not to use reflection and rather use delegation in favor of inheritance. looking forward to see some comments.
          Hide
          Robert Muir added a comment -

          Hello, first comment is that I really like how the IO-handling is done in CharacterUtils.

          This solves a problem across more than CharTokenizer, other tokenizers in lucene contrib that do NOT extend CharTokenizer have the same logic and also need to be fixed.

          So we could reuse this code in other places too, such as CJKTokenizer. I think we could also reuse this code to fix some unrelated problems in the n-gram tokenizers (at a glance, i do not see how the n-gram tokenizer io-handling even works correctly at all)

          Show
          Robert Muir added a comment - Hello, first comment is that I really like how the IO-handling is done in CharacterUtils. This solves a problem across more than CharTokenizer, other tokenizers in lucene contrib that do NOT extend CharTokenizer have the same logic and also need to be fixed. So we could reuse this code in other places too, such as CJKTokenizer. I think we could also reuse this code to fix some unrelated problems in the n-gram tokenizers (at a glance, i do not see how the n-gram tokenizer io-handling even works correctly at all)
          Hide
          Robert Muir added a comment - - edited

          Hello Simon, another option very similar to yours (I am not sure if it really would work, but just thinking out loud somewhat) could be:

          /** this method will be declared abstract in Lucene 4.0 */
          public int isTokenChar(int ch) {
            throw UOE();
          }
          
          /** @deprecated will be removed in Lucene 4.0 */
          public int isTokenChar(char ch) {
            return isTokenChar((int)ch);
          }
          

          and do the same for normalize(). The rest would be the same as your patch:

          • Use CharacterUtils for io-buffering
          • Use CharacterUtils for character/codepoint iteration.
          • Use Version to decide which method to call instead of reflection: this should not be conditional upon each call to isTokenChar() but instead two private inner classes or whatever.

          The difference would be that the api would appear more natural in my opinion, and once deprecations are removed we would end out with an abstract class with the int-equivalent of what we have now.

          If someone attempts to use a CharTokenizer that does not support int-based methods (only implements the char-based methods) with Version.LUCENE_31 then this would throw UOE, which in my opinion is correct, as it does not support the behavior of that version.

          edit: I changed the deprecation of isTokenChar(char) to 4.0. The index back compat would still exist because using CharacterUtils with isTokenChar(int) is the same thing.

          Show
          Robert Muir added a comment - - edited Hello Simon, another option very similar to yours (I am not sure if it really would work, but just thinking out loud somewhat) could be: /** this method will be declared abstract in Lucene 4.0 */ public int isTokenChar( int ch) { throw UOE(); } /** @deprecated will be removed in Lucene 4.0 */ public int isTokenChar( char ch) { return isTokenChar(( int )ch); } and do the same for normalize(). The rest would be the same as your patch: Use CharacterUtils for io-buffering Use CharacterUtils for character/codepoint iteration. Use Version to decide which method to call instead of reflection: this should not be conditional upon each call to isTokenChar() but instead two private inner classes or whatever. The difference would be that the api would appear more natural in my opinion, and once deprecations are removed we would end out with an abstract class with the int-equivalent of what we have now. If someone attempts to use a CharTokenizer that does not support int-based methods (only implements the char-based methods) with Version.LUCENE_31 then this would throw UOE, which in my opinion is correct, as it does not support the behavior of that version. edit: I changed the deprecation of isTokenChar(char) to 4.0. The index back compat would still exist because using CharacterUtils with isTokenChar(int) is the same thing.
          Hide
          Uwe Schindler added a comment -

          +1 @ Roberts comment! I like this more and it would work correctly as proposed with UOE..

          Show
          Uwe Schindler added a comment - +1 @ Roberts comment! I like this more and it would work correctly as proposed with UOE..
          Hide
          Robert Muir added a comment - - edited

          I thought about this some (the idea i mentioned earlier, not Simon's patch), but i am worried about one thing:

          Consider LetterTokenizer, which is non-final subclass of CharTokenizer.
          Lets say you make LetterAndNumberTokenizer which extends LetterTokenizer, but you do not implement the int-based method.

          public boolean isTokenChar(char c) {
            return super.isTokenChar(c) || Character.isNumber(c);
          }
          

          we have fixed LetterTokenizer so it has isTokenChar(int), but that means if someone tries to use this LettterAndNumberTokenizer with Version.LUCENE_31, it will not work, because it will not throw UOE, and silently discard numbers... since it will call the LetterTokenizer int-based method.

          of course it will work correctly with Version.LUCENE_30, so it is not a back compat problem, but it will not throw UOE and silently behave incorrectly for LUCENE_31 until the 'int' method is implemented.

          so i think this is a problem in this design, and i do not know how to fix without reflection.

          Show
          Robert Muir added a comment - - edited I thought about this some (the idea i mentioned earlier, not Simon's patch), but i am worried about one thing: Consider LetterTokenizer, which is non-final subclass of CharTokenizer. Lets say you make LetterAndNumberTokenizer which extends LetterTokenizer, but you do not implement the int-based method. public boolean isTokenChar( char c) { return super .isTokenChar(c) || Character .isNumber(c); } we have fixed LetterTokenizer so it has isTokenChar(int), but that means if someone tries to use this LettterAndNumberTokenizer with Version.LUCENE_31, it will not work, because it will not throw UOE, and silently discard numbers... since it will call the LetterTokenizer int-based method. of course it will work correctly with Version.LUCENE_30, so it is not a back compat problem, but it will not throw UOE and silently behave incorrectly for LUCENE_31 until the 'int' method is implemented. so i think this is a problem in this design, and i do not know how to fix without reflection.
          Hide
          Simon Willnauer added a comment -

          Hey guys thanks for your comments.
          when I started thinking about this issue I had a quick chat with robert and we figured that his solution could be working so I implemented it.
          Yet, i found 2 problems with it.
          1. If a user calls super.isTokenChar(char) and the super class has implemented the int method the UOE will never be thrown and the code does not behave like "expected" from the user perspective. - This is what robert explained above. We could solve this problem with reflection which leads to the second problem.

          2. If a Tokenizer like LowerCaseTokenizer only overrides normalize(char|int) it relies on the superclass implementation of isTokenChar. Yet if we solve problem 1. the user would be forced to override the isTokenChar to just call super.isTokenChar otherwise the reflection code will raise an exception that the int method is not implemented in the concrete class or will use the char API - anyway it will not do what is expected.

          Working around those two problem was the cause of a new API for CharTokenizer. My personal opinion is that inheritance is the wrong tool for changing behavior I used delegation (like a strategy) to on the one hand define a clear "new" API and decouple the code changing the behavior of the Tokenizer from the tokenizer itself. Inheritance for me is for extending a class and delegation is for changing behavior in this particular problem.
          Decoupling the old from the new has several advantages over a reflection / inheritance based solution.
          1. if a user does not provide a delegation impl he want to use the old API
          2. if a user does provide a delegation impl he has still the ability to choose between charprocessing in 3.0 style or 3.1 style
          3. no matter what is provided a user has full flexibility to choose the combination of their choice - old char processing - new int based api (maybe minor though)
          4. we can leave all tokeinizer subclasses as their are and define new functions that implement their behavior in parallel. those functions can be made final from the beginning and which prevents users from subclassing them. (all of the existing ones should be final in my opinion - like LowerCaseTokenizer which should call Character.isLetter in the isTokenCodePoint(int) directly instead of subclassing another function.)

          As a user I would expect lucene to revise their design decisions made years ago when there is a need for it like we have in this issue. It is easier to change behavior in user code by swapping to a new api instead of diggin into an workaround implementation of an old api silently calling a new API.

          Show
          Simon Willnauer added a comment - Hey guys thanks for your comments. when I started thinking about this issue I had a quick chat with robert and we figured that his solution could be working so I implemented it. Yet, i found 2 problems with it. 1. If a user calls super.isTokenChar(char) and the super class has implemented the int method the UOE will never be thrown and the code does not behave like "expected" from the user perspective. - This is what robert explained above. We could solve this problem with reflection which leads to the second problem. 2. If a Tokenizer like LowerCaseTokenizer only overrides normalize(char|int) it relies on the superclass implementation of isTokenChar. Yet if we solve problem 1. the user would be forced to override the isTokenChar to just call super.isTokenChar otherwise the reflection code will raise an exception that the int method is not implemented in the concrete class or will use the char API - anyway it will not do what is expected. Working around those two problem was the cause of a new API for CharTokenizer. My personal opinion is that inheritance is the wrong tool for changing behavior I used delegation (like a strategy) to on the one hand define a clear "new" API and decouple the code changing the behavior of the Tokenizer from the tokenizer itself. Inheritance for me is for extending a class and delegation is for changing behavior in this particular problem. Decoupling the old from the new has several advantages over a reflection / inheritance based solution. 1. if a user does not provide a delegation impl he want to use the old API 2. if a user does provide a delegation impl he has still the ability to choose between charprocessing in 3.0 style or 3.1 style 3. no matter what is provided a user has full flexibility to choose the combination of their choice - old char processing - new int based api (maybe minor though) 4. we can leave all tokeinizer subclasses as their are and define new functions that implement their behavior in parallel. those functions can be made final from the beginning and which prevents users from subclassing them. (all of the existing ones should be final in my opinion - like LowerCaseTokenizer which should call Character.isLetter in the isTokenCodePoint(int) directly instead of subclassing another function.) As a user I would expect lucene to revise their design decisions made years ago when there is a need for it like we have in this issue. It is easier to change behavior in user code by swapping to a new api instead of diggin into an workaround implementation of an old api silently calling a new API.
          Hide
          Robert Muir added a comment -

          1. If a user calls super.isTokenChar(char) and the super class has implemented the int method the UOE will never be thrown and the code does not behave like "expected" from the user perspective. - This is what robert explained above. We could solve this problem with reflection which leads to the second problem.

          2. If a Tokenizer like LowerCaseTokenizer only overrides normalize(char|int) it relies on the superclass implementation of isTokenChar. Yet if we solve problem 1. the user would be forced to override the isTokenChar to just call super.isTokenChar otherwise the reflection code will raise an exception that the int method is not implemented in the concrete class or will use the char API - anyway it will not do what is expected.

          i do not think this is true, what i was trying to do was modify the design i proposed so that we did not need reflection at all: but i think this is impossible.

          in the design you propose under the new api, subclassing is impossible, which I am not sure I like either.

          #2 is no problem at all, instead the reflection code to address #1 must be implemented with these conditions

          • A is the class implementing method isTokenChar(int)
          • B is the class implementing method isTokenChar(char)
          • B is a subclass of A
          • A is not CharTokenizer
          Show
          Robert Muir added a comment - 1. If a user calls super.isTokenChar(char) and the super class has implemented the int method the UOE will never be thrown and the code does not behave like "expected" from the user perspective. - This is what robert explained above. We could solve this problem with reflection which leads to the second problem. 2. If a Tokenizer like LowerCaseTokenizer only overrides normalize(char|int) it relies on the superclass implementation of isTokenChar. Yet if we solve problem 1. the user would be forced to override the isTokenChar to just call super.isTokenChar otherwise the reflection code will raise an exception that the int method is not implemented in the concrete class or will use the char API - anyway it will not do what is expected. i do not think this is true, what i was trying to do was modify the design i proposed so that we did not need reflection at all: but i think this is impossible. in the design you propose under the new api, subclassing is impossible, which I am not sure I like either. #2 is no problem at all, instead the reflection code to address #1 must be implemented with these conditions A is the class implementing method isTokenChar(int) B is the class implementing method isTokenChar(char) B is a subclass of A A is not CharTokenizer
          Hide
          Simon Willnauer added a comment -

          #2 is no problem at all, instead the reflection code to address #1 must be implemented with these conditions

          • A is the class implementing method isTokenChar(int)
          • B is the class implementing method isTokenChar(char)
          • B is a subclass of A
          • A is not CharTokenizer

          ok here is a scenario:

          class MySmartDeseretTokenizer extends LetterTokenizer {
            
            public boolean isTokenChar(char c) {
              // we trust that DeseretHighLow surrogates are never unpaired
              return super.isTokenChar(c) || isDeseretHighLowSurrogate(c);
            }
          
            public char nomalize(char c) {
              if(isDeseretHighSurrogate(c))
                return c;
              if(isDeseretLowSurrogate(c))
               return lowerCaseDeseret('\ud801', c)[1];
              return Character.toLowercase(c);
            }
          
            public int normalize(int c) {
              return Character.toLowerCase(c);
            }
          }
          
          

          if somebody has similar code like this they might want to preserve compat because they have different versions of their app. Yet the old app only supports deseret high surrogates but the new one accepts all letter supplementary chars due to super.isTokenChar(int). This scenario will break our reflection solution and users might be disappointed though as the new api is there to bring the unicode support. I don't say this scenario exists but it could be a valid one for a very special usecase.

          I don't say my proposal is THE way to go but I really don't want to use reflection - this would make things worse IMO.
          Lets find a solution that fits to all scenarios.

          in the design you propose under the new api, subclassing is impossible, which I am not sure I like either.

          Hmm, that is not true. You can still subclass and pass your impl up to the superclass. I haven't implemented that yet but this is def. possible.

          Show
          Simon Willnauer added a comment - #2 is no problem at all, instead the reflection code to address #1 must be implemented with these conditions A is the class implementing method isTokenChar(int) B is the class implementing method isTokenChar(char) B is a subclass of A A is not CharTokenizer ok here is a scenario: class MySmartDeseretTokenizer extends LetterTokenizer { public boolean isTokenChar( char c) { // we trust that DeseretHighLow surrogates are never unpaired return super .isTokenChar(c) || isDeseretHighLowSurrogate(c); } public char nomalize( char c) { if (isDeseretHighSurrogate(c)) return c; if (isDeseretLowSurrogate(c)) return lowerCaseDeseret('\ud801', c)[1]; return Character .toLowercase(c); } public int normalize( int c) { return Character .toLowerCase(c); } } if somebody has similar code like this they might want to preserve compat because they have different versions of their app. Yet the old app only supports deseret high surrogates but the new one accepts all letter supplementary chars due to super.isTokenChar(int). This scenario will break our reflection solution and users might be disappointed though as the new api is there to bring the unicode support. I don't say this scenario exists but it could be a valid one for a very special usecase. I don't say my proposal is THE way to go but I really don't want to use reflection - this would make things worse IMO. Lets find a solution that fits to all scenarios. in the design you propose under the new api, subclassing is impossible, which I am not sure I like either. Hmm, that is not true. You can still subclass and pass your impl up to the superclass. I haven't implemented that yet but this is def. possible.
          Hide
          Robert Muir added a comment -

          Simon, I don't think your example is a problem.

          I am proposing my original design, with no reflection, driven by Version only.

          There is only one exception where reflection is used... that is during ctor to determine if:

          • you subclass a tokenizer that implements int-based methods
          • you have only implemented char-based methods
          • you request VERSION >= 3.1

          in this case, the reflection is only used in the ctor to throw UOE!

          if someone wants to support VERSION 3.1 in their app, they simply implement the int-based methods.
          to support lower versions, they do nothing, they do not need to implement char-based methods, they get the backwards compat automatically, as long as they supply the correct version. this is guaranteed by CharacterUtils.

          I am only proposing using reflection to enforce the throwing of UOE, in the case that someone requests VERSION 3.1, but has not implemented int.

          if they want to support Version <= 3.1, this is fine, it will work with their char-based stuff automatically.

          I think it would be easiest if i modified your patch to illustrate this, so i'll do it in a few days.

          Show
          Robert Muir added a comment - Simon, I don't think your example is a problem. I am proposing my original design, with no reflection, driven by Version only. There is only one exception where reflection is used... that is during ctor to determine if: you subclass a tokenizer that implements int-based methods you have only implemented char-based methods you request VERSION >= 3.1 in this case, the reflection is only used in the ctor to throw UOE! if someone wants to support VERSION 3.1 in their app, they simply implement the int-based methods. to support lower versions, they do nothing, they do not need to implement char-based methods, they get the backwards compat automatically, as long as they supply the correct version. this is guaranteed by CharacterUtils. I am only proposing using reflection to enforce the throwing of UOE, in the case that someone requests VERSION 3.1, but has not implemented int. if they want to support Version <= 3.1, this is fine, it will work with their char-based stuff automatically. I think it would be easiest if i modified your patch to illustrate this, so i'll do it in a few days.
          Hide
          Simon Willnauer added a comment -

          This issue is blocked by: ...

          I give up...

          Show
          Simon Willnauer added a comment - This issue is blocked by: ... I give up...
          Hide
          Uwe Schindler added a comment -

          I give up...

          Hehe

          Show
          Uwe Schindler added a comment - I give up... Hehe
          Hide
          Uwe Schindler added a comment -

          this linkage is less hard.

          Show
          Uwe Schindler added a comment - this linkage is less hard.
          Hide
          Uwe Schindler added a comment - - edited

          There is only one exception where reflection is used... that is during ctor to determine if:

          • you subclass a tokenizer that implements int-based methods
          • you have only implemented char-based methods
          • you request VERSION >= 3.1

          With LUCENE-2188, this is easy and no performance problem. Just define two static final fields for both char-based methods and check in the ctor if this.getClass() overrides the char-based method. In this case throw UOE. The result is cached for the class and further instantiations of the same class will not use reflection anymore:

          private static final VirtualMethod<CharTokenizer> isTokenCharMethod =
              new VirtualMethod<CharTokenizer>(CharTokenizer.class, "isTokenChar", char.class);
          private static final VirtualMethod<CharTokenizer> normalizeMethod =
              new VirtualMethod<CharTokenizer>(CharTokenizer.class, "normalize", char.class);
          ...
          public CharTokenizer(...) {
            super(...)
            if (matchVersion.onOrAfter(Version.LUCENE_31) && (
             isTokenCharMethod.isOverriddenAsOf(this.getClass()) || normalizeMethod.isOverriddenAsOf(this.getClass())
            ) throw new IAE("For matchVersion >= LUCENE_31, CharTokenizer subclasses must not override isTokenChar(char) or normalize(char)."):
          }
          
          Show
          Uwe Schindler added a comment - - edited There is only one exception where reflection is used... that is during ctor to determine if: you subclass a tokenizer that implements int-based methods you have only implemented char-based methods you request VERSION >= 3.1 With LUCENE-2188 , this is easy and no performance problem. Just define two static final fields for both char-based methods and check in the ctor if this.getClass() overrides the char-based method. In this case throw UOE. The result is cached for the class and further instantiations of the same class will not use reflection anymore: private static final VirtualMethod<CharTokenizer> isTokenCharMethod = new VirtualMethod<CharTokenizer>(CharTokenizer.class, "isTokenChar" , char .class); private static final VirtualMethod<CharTokenizer> normalizeMethod = new VirtualMethod<CharTokenizer>(CharTokenizer.class, "normalize" , char .class); ... public CharTokenizer(...) { super (...) if (matchVersion.onOrAfter(Version.LUCENE_31) && ( isTokenCharMethod.isOverriddenAsOf( this .getClass()) || normalizeMethod.isOverriddenAsOf( this .getClass()) ) throw new IAE( "For matchVersion >= LUCENE_31, CharTokenizer subclasses must not override isTokenChar( char ) or normalize( char )." ): }
          Hide
          Simon Willnauer added a comment -

          I updated the patch to make use of the nice reflection utils and ported all subclasses of CharTokenizer to the int based API.
          Due to the addition of Version to CharTokenizer ctors this patch creates a lot of usage of deprecated API.
          Yet, I haven't changed all the usage of the deprecated ctors, this should be done in another issue IMO.

          Show
          Simon Willnauer added a comment - I updated the patch to make use of the nice reflection utils and ported all subclasses of CharTokenizer to the int based API. Due to the addition of Version to CharTokenizer ctors this patch creates a lot of usage of deprecated API. Yet, I haven't changed all the usage of the deprecated ctors, this should be done in another issue IMO.
          Hide
          Uwe Schindler added a comment -

          Have not looked detailed into it yet, but it looks correct. I am not sure about the overhead of passing each char through the proxy class. My idea would be to declare CharFunction as a private interface and let CharTokenizer implement it (invisible to the outside, so it can be removed in later versions). The ctor then passes "this" as CharFunction if >=3.1 and a new proxy instance of the interface for the deprecated case. By this at least the new stuff does not have extra method calls.

          The VirtualMethod stuff looks ok, thanks for using it as suggested here!

          Show
          Uwe Schindler added a comment - Have not looked detailed into it yet, but it looks correct. I am not sure about the overhead of passing each char through the proxy class. My idea would be to declare CharFunction as a private interface and let CharTokenizer implement it (invisible to the outside, so it can be removed in later versions). The ctor then passes "this" as CharFunction if >=3.1 and a new proxy instance of the interface for the deprecated case. By this at least the new stuff does not have extra method calls. The VirtualMethod stuff looks ok, thanks for using it as suggested here!
          Hide
          Simon Willnauer added a comment -

          Uwe, using an interface doesn't work though as I can not reduce the public visibility in CharTokeinzer. Yet, this patch tries to solve it with an abstract class.
          To be honest I would rather say we duplicate the code and use a simple boolean switch in incrementToken. Not that nice but def. faster.

          what do you think?

          Show
          Simon Willnauer added a comment - Uwe, using an interface doesn't work though as I can not reduce the public visibility in CharTokeinzer. Yet, this patch tries to solve it with an abstract class. To be honest I would rather say we duplicate the code and use a simple boolean switch in incrementToken. Not that nice but def. faster. what do you think?
          Hide
          Uwe Schindler added a comment -

          +1, because this is very speed-sensitive.

          Show
          Uwe Schindler added a comment - +1, because this is very speed-sensitive.
          Hide
          Simon Willnauer added a comment -

          This version "duplicates" the incrementToken method to prevent any unnecessary method call inside the loop.
          I also use the Character static methods directly where possible without the CharacterUtils indirection.

          Show
          Simon Willnauer added a comment - This version "duplicates" the incrementToken method to prevent any unnecessary method call inside the loop. I also use the Character static methods directly where possible without the CharacterUtils indirection.
          Hide
          Simon Willnauer added a comment -

          Short update: I found a bug in the latest version which was untested I will update soon with a speed comparison between the current version and the version using the proxy class.

          Show
          Simon Willnauer added a comment - Short update: I found a bug in the latest version which was untested I will update soon with a speed comparison between the current version and the version using the proxy class.
          Hide
          Simon Willnauer added a comment -

          Added CHANGES.TXT entry and fixed 2 supplementary chars related bugs in the new version of incrementToken(). Testcases added for the bugs.

          Show
          Simon Willnauer added a comment - Added CHANGES.TXT entry and fixed 2 supplementary chars related bugs in the new version of incrementToken(). Testcases added for the bugs.
          Hide
          Simon Willnauer added a comment - - edited

          I did run following benchmark alg file against the latest patch (specialized old and new methods), the patch with the proxy methods and the old 3.0 code. The outcome shows that the specialized code is about ~8% faster than the proxy class based code so I would rather keep the specialized code as this class is performance sensitive though

          .alg file

          analyzer=org.apache.lucene.analysis.WhitespaceAnalyzer
          content.source=org.apache.lucene.benchmark.byTask.feeds.ReutersContentSource
          content.source.forever=false
          { "Rounds" { "ReadTokens" ReadTokens > : *  NewRound ResetSystemErase} : 10
          RepAll
          

          10 Rounds with the latest patch

               [java] ------------> Report All (11 out of 12)
               [java] Operation          round   runCnt   recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
               [java] Rounds_10              0        1            0         0.00       14.83     5,049,432     66,453,504
               [java] ReadTokens_Exhaust -   0 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   2.07 -  34,558,000 -   55,705,600
               [java] ReadTokens_Exhaust     1        1            0         0.00        1.40    41,865,312     60,555,264
               [java] ReadTokens_Exhaust -   2 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   1.22 -  34,393,904 -   63,176,704
               [java] ReadTokens_Exhaust     3        1            0         0.00        1.24    15,440,624     64,487,424
               [java] ReadTokens_Exhaust -   4 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   1.22 -   7,540,512 -   65,601,536
               [java] ReadTokens_Exhaust     5        1            0         0.00        1.21    50,174,760     67,239,936
               [java] ReadTokens_Exhaust -   6 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   1.19 -  22,202,768 -   67,174,400
               [java] ReadTokens_Exhaust     7        1            0         0.00        1.19    20,591,672     68,812,800
               [java] ReadTokens_Exhaust -   8 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   1.18 -  63,749,984 -   69,009,408
               [java] ReadTokens_Exhaust     9        1            0         0.00        1.19    22,331,600     68,943,872
          

          10 rounds with Proxy Class

               [java] ------------> Report All (11 out of 12)
               [java] Operation          round   runCnt   recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
               [java] Rounds_10              0        1            0         0.00       16.33     5,021,144     67,436,544
               [java] ReadTokens_Exhaust -   0 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   2.34 -  44,649,496 -   59,244,544
               [java] ReadTokens_Exhaust     1        1            0         0.00        1.53    36,681,952     61,472,768
               [java] ReadTokens_Exhaust -   2 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   1.37 -  13,863,688 -   64,094,208
               [java] ReadTokens_Exhaust     3        1            0         0.00        1.34    50,247,864     65,470,464
               [java] ReadTokens_Exhaust -   4 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   1.36 -  14,922,888 -   66,322,432
               [java] ReadTokens_Exhaust     5        1            0         0.00        1.36     5,718,296     67,371,008
               [java] ReadTokens_Exhaust -   6 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   1.32 -  54,583,776 -   67,502,080
               [java] ReadTokens_Exhaust     7        1            0         0.00        1.33    35,739,800     68,943,872
               [java] ReadTokens_Exhaust -   8 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   1.32 -  24,985,688 -   69,861,376
               [java] ReadTokens_Exhaust     9        1            0         0.00        1.29    64,138,112     69,730,304
          

          10 rounds with current trunk

               [java] ------------> Report All (11 out of 12)
               [java] Operation          round   runCnt   recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
               [java] Rounds_10                   0        1            0         0.00       15.19     5,040,928     66,256,896
               [java] ReadTokens_Exhaust -   0 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   2.15 -  39,548,440 -   55,443,456
               [java] ReadTokens_Exhaust     1        1            0         0.00        1.43    28,088,544     60,096,512
               [java] ReadTokens_Exhaust -   2 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   1.27 -  16,004,088 -   61,800,448
               [java] ReadTokens_Exhaust     3        1            0         0.00        1.25    51,034,016     63,045,632
               [java] ReadTokens_Exhaust -   4 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   1.24 -  23,371,056 -   63,504,384
               [java] ReadTokens_Exhaust     5        1            0         0.00        1.24    12,964,368     65,208,320
               [java] ReadTokens_Exhaust -   6 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   1.25 -   6,598,128 -   65,601,536
               [java] ReadTokens_Exhaust     7        1            0         0.00        1.23    50,932,464     67,239,936
               [java] ReadTokens_Exhaust -   8 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   1.24 -  20,433,136 -   67,305,472
               [java] ReadTokens_Exhaust     9        1            0         0.00        1.23    63,638,552     68,812,800
          
          
          Show
          Simon Willnauer added a comment - - edited I did run following benchmark alg file against the latest patch (specialized old and new methods), the patch with the proxy methods and the old 3.0 code. The outcome shows that the specialized code is about ~8% faster than the proxy class based code so I would rather keep the specialized code as this class is performance sensitive though .alg file analyzer=org.apache.lucene.analysis.WhitespaceAnalyzer content.source=org.apache.lucene.benchmark.byTask.feeds.ReutersContentSource content.source.forever= false { "Rounds" { "ReadTokens" ReadTokens > : * NewRound ResetSystemErase} : 10 RepAll 10 Rounds with the latest patch [java] ------------> Report All (11 out of 12) [java] Operation round runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem [java] Rounds_10 0 1 0 0.00 14.83 5,049,432 66,453,504 [java] ReadTokens_Exhaust - 0 - - 1 - - - - 0 - - - 0.00 - - 2.07 - 34,558,000 - 55,705,600 [java] ReadTokens_Exhaust 1 1 0 0.00 1.40 41,865,312 60,555,264 [java] ReadTokens_Exhaust - 2 - - 1 - - - - 0 - - - 0.00 - - 1.22 - 34,393,904 - 63,176,704 [java] ReadTokens_Exhaust 3 1 0 0.00 1.24 15,440,624 64,487,424 [java] ReadTokens_Exhaust - 4 - - 1 - - - - 0 - - - 0.00 - - 1.22 - 7,540,512 - 65,601,536 [java] ReadTokens_Exhaust 5 1 0 0.00 1.21 50,174,760 67,239,936 [java] ReadTokens_Exhaust - 6 - - 1 - - - - 0 - - - 0.00 - - 1.19 - 22,202,768 - 67,174,400 [java] ReadTokens_Exhaust 7 1 0 0.00 1.19 20,591,672 68,812,800 [java] ReadTokens_Exhaust - 8 - - 1 - - - - 0 - - - 0.00 - - 1.18 - 63,749,984 - 69,009,408 [java] ReadTokens_Exhaust 9 1 0 0.00 1.19 22,331,600 68,943,872 10 rounds with Proxy Class [java] ------------> Report All (11 out of 12) [java] Operation round runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem [java] Rounds_10 0 1 0 0.00 16.33 5,021,144 67,436,544 [java] ReadTokens_Exhaust - 0 - - 1 - - - - 0 - - - 0.00 - - 2.34 - 44,649,496 - 59,244,544 [java] ReadTokens_Exhaust 1 1 0 0.00 1.53 36,681,952 61,472,768 [java] ReadTokens_Exhaust - 2 - - 1 - - - - 0 - - - 0.00 - - 1.37 - 13,863,688 - 64,094,208 [java] ReadTokens_Exhaust 3 1 0 0.00 1.34 50,247,864 65,470,464 [java] ReadTokens_Exhaust - 4 - - 1 - - - - 0 - - - 0.00 - - 1.36 - 14,922,888 - 66,322,432 [java] ReadTokens_Exhaust 5 1 0 0.00 1.36 5,718,296 67,371,008 [java] ReadTokens_Exhaust - 6 - - 1 - - - - 0 - - - 0.00 - - 1.32 - 54,583,776 - 67,502,080 [java] ReadTokens_Exhaust 7 1 0 0.00 1.33 35,739,800 68,943,872 [java] ReadTokens_Exhaust - 8 - - 1 - - - - 0 - - - 0.00 - - 1.32 - 24,985,688 - 69,861,376 [java] ReadTokens_Exhaust 9 1 0 0.00 1.29 64,138,112 69,730,304 10 rounds with current trunk [java] ------------> Report All (11 out of 12) [java] Operation round runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem [java] Rounds_10 0 1 0 0.00 15.19 5,040,928 66,256,896 [java] ReadTokens_Exhaust - 0 - - 1 - - - - 0 - - - 0.00 - - 2.15 - 39,548,440 - 55,443,456 [java] ReadTokens_Exhaust 1 1 0 0.00 1.43 28,088,544 60,096,512 [java] ReadTokens_Exhaust - 2 - - 1 - - - - 0 - - - 0.00 - - 1.27 - 16,004,088 - 61,800,448 [java] ReadTokens_Exhaust 3 1 0 0.00 1.25 51,034,016 63,045,632 [java] ReadTokens_Exhaust - 4 - - 1 - - - - 0 - - - 0.00 - - 1.24 - 23,371,056 - 63,504,384 [java] ReadTokens_Exhaust 5 1 0 0.00 1.24 12,964,368 65,208,320 [java] ReadTokens_Exhaust - 6 - - 1 - - - - 0 - - - 0.00 - - 1.25 - 6,598,128 - 65,601,536 [java] ReadTokens_Exhaust 7 1 0 0.00 1.23 50,932,464 67,239,936 [java] ReadTokens_Exhaust - 8 - - 1 - - - - 0 - - - 0.00 - - 1.24 - 20,433,136 - 67,305,472 [java] ReadTokens_Exhaust 9 1 0 0.00 1.23 63,638,552 68,812,800
          Hide
          Uwe Schindler added a comment -

          I'll commit this, if nobody objects in a day.

          Show
          Uwe Schindler added a comment - I'll commit this, if nobody objects in a day.
          Hide
          Robert Muir added a comment -

          hello, a few very minor nitpicks:

          • it seems the javadoc for isTokenChar()/isTokenInt() is backwards, the int-based api refers to 'character' but the char-based api refers to codepoint (which is wrong). perhaps it would be best to refer to all char-based methods as working on 'utf-16 code unit' and int-based apis as 'codepoint'.
          • we might want to insert a note/warning on the char-based methods, consistent with the JDK javadocs, "Note this method cannot handle supplementary characters..." for example, like: http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#getType%28char%29 I think its important to include the link to the JDK explanation of what a supplementary character is, also. http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#supplementary
          • if possible we might want to include some class-level wordage on how the whole thing works. If you implement the int-based API, you can use your class with all Lucene Versions, and bw layer will make it work correctly with old indexes. If you only stay with the char-based API, then you can only use your CharTokenizer for Version <= 3.0. We can also mention it is unnecessary to implement both, only the int-based api!!!

          sorry for being picky about the javadocs, trying to think of ways to prevent confusion on the user list, as i anticipate a bunch of people probably just use Version.LUCENE_CURRENT and have no idea what a supplementary character is.

          great work.

          Show
          Robert Muir added a comment - hello, a few very minor nitpicks: it seems the javadoc for isTokenChar()/isTokenInt() is backwards, the int-based api refers to 'character' but the char-based api refers to codepoint (which is wrong). perhaps it would be best to refer to all char-based methods as working on 'utf-16 code unit' and int-based apis as 'codepoint'. we might want to insert a note/warning on the char-based methods, consistent with the JDK javadocs, "Note this method cannot handle supplementary characters..." for example, like: http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#getType%28char%29 I think its important to include the link to the JDK explanation of what a supplementary character is, also. http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#supplementary if possible we might want to include some class-level wordage on how the whole thing works. If you implement the int-based API, you can use your class with all Lucene Versions, and bw layer will make it work correctly with old indexes. If you only stay with the char-based API, then you can only use your CharTokenizer for Version <= 3.0. We can also mention it is unnecessary to implement both, only the int-based api!!! sorry for being picky about the javadocs, trying to think of ways to prevent confusion on the user list, as i anticipate a bunch of people probably just use Version.LUCENE_CURRENT and have no idea what a supplementary character is. great work.
          Hide
          Uwe Schindler added a comment -

          we might want to insert a note/warning on the char-based methods, consistent with the JDK javadocs, "Note this method cannot handle supplementary characters..." for example, like: http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#getType%28char%29 I think its important to include the link to the JDK explanation of what a supplementary character is, also. http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#supplementary

          For that a link using javadoc

          {@link Character#supplementary}

          would be good. I will fix this here, as I already have the patcxh applied and will commit it later.

          if possible we might want to include some class-level wordage on how the whole thing works. If you implement the int-based API, you can use your class with all Lucene Versions, and bw layer will make it work correctly with old indexes. If you only stay with the char-based API, then you can only use your CharTokenizer for Version <= 3.0. We can also mention it is unnecessary to implement both, only the int-based api!!!

          ++++++1. The old TokenStream API had a check for these problems, right?

          Show
          Uwe Schindler added a comment - we might want to insert a note/warning on the char-based methods, consistent with the JDK javadocs, "Note this method cannot handle supplementary characters..." for example, like: http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#getType%28char%29 I think its important to include the link to the JDK explanation of what a supplementary character is, also. http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#supplementary For that a link using javadoc {@link Character#supplementary} would be good. I will fix this here, as I already have the patcxh applied and will commit it later. if possible we might want to include some class-level wordage on how the whole thing works. If you implement the int-based API, you can use your class with all Lucene Versions, and bw layer will make it work correctly with old indexes. If you only stay with the char-based API, then you can only use your CharTokenizer for Version <= 3.0. We can also mention it is unnecessary to implement both, only the int-based api!!! ++++++1. The old TokenStream API had a check for these problems, right?
          Hide
          Simon Willnauer added a comment -

          For that a link using javadoc {@link Character#supplementary} would be good. I will fix this here, as I already have the patcxh applied and will commit it later.

          Uwe I will take care of it and upload another patch. Thanks for being picky rob!

          Show
          Simon Willnauer added a comment - For that a link using javadoc {@link Character#supplementary} would be good. I will fix this here, as I already have the patcxh applied and will commit it later. Uwe I will take care of it and upload another patch. Thanks for being picky rob!
          Hide
          Uwe Schindler added a comment -

          here other javadoc fixes

          Show
          Uwe Schindler added a comment - here other javadoc fixes
          Hide
          Simon Willnauer added a comment -

          Robert / Uwe,

          I worked on the documentation on method and class level. I would appreciate a wording review.

          thanks

          Show
          Simon Willnauer added a comment - Robert / Uwe, I worked on the documentation on method and class level. I would appreciate a wording review. thanks
          Hide
          Robert Muir added a comment -

          hello simon, thanks for adding this additional wording.

          Show
          Robert Muir added a comment - hello simon, thanks for adding this additional wording.
          Hide
          Uwe Schindler added a comment -

          OK, will commit that tomorrow. Thanks Simon & Robert!

          Show
          Uwe Schindler added a comment - OK, will commit that tomorrow. Thanks Simon & Robert!
          Hide
          Uwe Schindler added a comment -

          Committed revision: 904401
          Thanks Simon!

          Show
          Uwe Schindler added a comment - Committed revision: 904401 Thanks Simon!

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Simon Willnauer
            • Votes:
              1 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development