Solr
  1. Solr
  2. SOLR-1423

Lucene 2.9 RC4 may need some changes in Solr Analyzers using CharStream & others

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 1.4
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      Because of some backwards compatibility problems (LUCENE-1906) we changed the CharStream/CharFilter API a little bit. Tokenizer now only has a input field of type java.io.Reader (as before the CharStream code). To correct offsets, it is now needed to call the Tokenizer.correctOffset(int) method, which delegates to the CharStream (if input is subclass of CharStream), else returns an uncorrected offset. Normally it is enough to change all occurences of input.correctOffset() to this.correctOffset() in Tokenizers. It should also be checked, if custom Tokenizers in Solr do correct their offsets.

      1. SOLR-1423.patch
        8 kB
        Koji Sekiguchi
      2. SOLR-1423.patch
        8 kB
        Uwe Schindler
      3. SOLR-1423.patch
        7 kB
        Koji Sekiguchi
      4. SOLR-1423-FieldType.patch
        0.6 kB
        Uwe Schindler
      5. SOLR-1423-fix-empty-tokens.patch
        14 kB
        Uwe Schindler
      6. SOLR-1423-fix-empty-tokens.patch
        13 kB
        Uwe Schindler
      7. SOLR-1423-with-empty-tokens.patch
        11 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Koji Sekiguchi added a comment -

          I'd like to check it before 1.4 release. I'll look into it once RC4 is checked in Solr.

          Show
          Koji Sekiguchi added a comment - I'd like to check it before 1.4 release. I'll look into it once RC4 is checked in Solr.
          Hide
          Koji Sekiguchi added a comment -

          I thought I call tokenizer.correctOffset() in newToken() method, but I couldn't because the method is protected. In this patch, I converted the anonymous Tokenizer class to PatternTokenizer, and PatternTokenizer has the following:

          +    public int correct( int currentOffset ){                                   
          +      return correctOffset( currentOffset );                                   
          +    }                                                                          
          
          Show
          Koji Sekiguchi added a comment - I thought I call tokenizer.correctOffset() in newToken() method, but I couldn't because the method is protected. In this patch, I converted the anonymous Tokenizer class to PatternTokenizer, and PatternTokenizer has the following: + public int correct( int currentOffset ){ + return correctOffset( currentOffset ); + }
          Hide
          Uwe Schindler added a comment -

          I have seen this PatternTokenizer, too. The method is protected, as the corectOffset should only be called from inside Tokenizer (e.g. in incrementToken), never from the outside. Why does the PatternTokenizer does not have the methods newToken and so on in its own class (by the way: This should be updated to new TokenStream API)?

          Show
          Uwe Schindler added a comment - I have seen this PatternTokenizer, too. The method is protected, as the corectOffset should only be called from inside Tokenizer (e.g. in incrementToken), never from the outside. Why does the PatternTokenizer does not have the methods newToken and so on in its own class (by the way: This should be updated to new TokenStream API)?
          Hide
          Uwe Schindler added a comment -

          I searched for setOffset() in Solr source code and found one additional occurence of it without offset correcting in FieldType.java. This patch fixes this.

          I will provide a completely streaming PatternTokenizer not using ArrayLists soon. It will move the while(matcher.find()) loop into the incrementToken() enumeration and will also use the new TokenStream API.

          Show
          Uwe Schindler added a comment - I searched for setOffset() in Solr source code and found one additional occurence of it without offset correcting in FieldType.java. This patch fixes this. I will provide a completely streaming PatternTokenizer not using ArrayLists soon. It will move the while(matcher.find()) loop into the incrementToken() enumeration and will also use the new TokenStream API.
          Hide
          Uwe Schindler added a comment -

          This is a complete more effective rewrite of the whole Tokenizer (I would like to put this into, Lucene contrib, too!) using the new TokenStream API.

          When going through the code, I realized the following: This Tokenizer can return empty tokens, it only filters enpty tokens in split() mode. Is this exspected? If empty tokens should be omitted, the if (matcher.find()) should be replaced by while (match.find()) with if match.length==0 continue; - The logic behind the strange omit empty token at the end will get very simple after this change.

          This patch removes the whole split()/group() methods from the factory as not needed anymore. If this is a backwards break, replace them by not used dummies (e.g. initialize a Tokenizer and return the token's TermText).

          In my opinion, one should never index empty tokens...

          A second thing: Lucene has a new BaseTokenStreamTest class for checking tokens without Token instances (which would no loger work, when Lucene 3.0 switches to Attributes only). Maybe you should update these test and use assertAnalyzesTo from the new base class instead.

          Show
          Uwe Schindler added a comment - This is a complete more effective rewrite of the whole Tokenizer (I would like to put this into, Lucene contrib, too!) using the new TokenStream API. When going through the code, I realized the following: This Tokenizer can return empty tokens, it only filters enpty tokens in split() mode. Is this exspected? If empty tokens should be omitted, the if (matcher.find()) should be replaced by while (match.find()) with if match.length==0 continue; - The logic behind the strange omit empty token at the end will get very simple after this change. This patch removes the whole split()/group() methods from the factory as not needed anymore. If this is a backwards break, replace them by not used dummies (e.g. initialize a Tokenizer and return the token's TermText). In my opinion, one should never index empty tokens... A second thing: Lucene has a new BaseTokenStreamTest class for checking tokens without Token instances (which would no loger work, when Lucene 3.0 switches to Attributes only). Maybe you should update these test and use assertAnalyzesTo from the new base class instead.
          Hide
          Koji Sekiguchi added a comment -

          The patch that is Uwe's one with replacing split()/group() methods.

          Why does the PatternTokenizer does not have the methods newToken and so on in its own class

          Yeah, I'd realized it immediately after posting the patch, but I was going to be out.

          And thank you for adapting it for new TokenStream API.

          I searched for setOffset() in Solr source code and found one additional occurence of it without offset correcting in FieldType.java. This patch fixes this.

          Good catch, Uwe! I slipped over it.

          I think the empty tokens is a bug and should be omitted in this patch.

          A second thing: Lucene has a new BaseTokenStreamTest class for checking tokens without Token instances (which would no loger work, when Lucene 3.0 switches to Attributes only). Maybe you should update these test and use assertAnalyzesTo from the new base class instead.

          Very nice! Can you open a separate ticket?

          Show
          Koji Sekiguchi added a comment - The patch that is Uwe's one with replacing split()/group() methods. Why does the PatternTokenizer does not have the methods newToken and so on in its own class Yeah, I'd realized it immediately after posting the patch, but I was going to be out. And thank you for adapting it for new TokenStream API. I searched for setOffset() in Solr source code and found one additional occurence of it without offset correcting in FieldType.java. This patch fixes this. Good catch, Uwe! I slipped over it. I think the empty tokens is a bug and should be omitted in this patch. A second thing: Lucene has a new BaseTokenStreamTest class for checking tokens without Token instances (which would no loger work, when Lucene 3.0 switches to Attributes only). Maybe you should update these test and use assertAnalyzesTo from the new base class instead. Very nice! Can you open a separate ticket?
          Hide
          Uwe Schindler added a comment -

          I think the empty tokens is a bug and should be omitted in this patch.

          The Javadocs say, that it works like String.split() which return empty tokens, but strips empty tokens at the end of the string. This functionality is provided by Solr before and with this patch.
          The code would get simplier, if the Tokenizer would generally strip empty tokens, but it is a backwards break. I would tend to just commit and then open another issue.

          Very nice! Can you open a separate ticket?

          Will open one about Lucene's BaseTokenStreamTestCase

          Show
          Uwe Schindler added a comment - I think the empty tokens is a bug and should be omitted in this patch. The Javadocs say, that it works like String.split() which return empty tokens, but strips empty tokens at the end of the string. This functionality is provided by Solr before and with this patch. The code would get simplier, if the Tokenizer would generally strip empty tokens, but it is a backwards break. I would tend to just commit and then open another issue. Very nice! Can you open a separate ticket? Will open one about Lucene's BaseTokenStreamTestCase
          Hide
          Uwe Schindler added a comment -

          I foget: I would deprecate the unneeded methods in the factory!

          Show
          Uwe Schindler added a comment - I foget: I would deprecate the unneeded methods in the factory!
          Hide
          Uwe Schindler added a comment -

          Some refactoring (I moved the PatternTokenizer to its own class, like PatternReplaceFilter). This patch is functionally identical to current trunk, but more effective and uses new TokenStream API and implements end() (which sets the offset to the end of the string).

          I will soon post a patch, which removes empty tokens.

          Show
          Uwe Schindler added a comment - Some refactoring (I moved the PatternTokenizer to its own class, like PatternReplaceFilter). This patch is functionally identical to current trunk, but more effective and uses new TokenStream API and implements end() (which sets the offset to the end of the string). I will soon post a patch, which removes empty tokens.
          Hide
          Uwe Schindler added a comment -

          This is a patch that fixes the empty tokens:
          This Tokenizer is not backwards compatible, as it only return non-zero length tokens. Maybe we should have a switch somewhere to change this behaviour. It is currently for discussion only.

          Show
          Uwe Schindler added a comment - This is a patch that fixes the empty tokens: This Tokenizer is not backwards compatible, as it only return non-zero length tokens. Maybe we should have a switch somewhere to change this behaviour. It is currently for discussion only.
          Hide
          Yonik Seeley added a comment -

          This Tokenizer is not backwards compatible, as it only return non-zero length tokens. Maybe we should have a switch somewhere to change this behaviour.

          Passing through only non-zero length tokens was probably always the intent, and the old behavior is a bug and isn't useful, so I don't think we need a switch.

          Show
          Yonik Seeley added a comment - This Tokenizer is not backwards compatible, as it only return non-zero length tokens. Maybe we should have a switch somewhere to change this behaviour. Passing through only non-zero length tokens was probably always the intent, and the old behavior is a bug and isn't useful, so I don't think we need a switch.
          Hide
          Uwe Schindler added a comment -

          Then you could use SOLR-1423-fix-empty-tokens.patch it should work. The comparison with String.split() in one of the tests was commented out, because it does not work with the tokenizer (as empty tokens are not returned).

          I only wanted to check, that the offsets are calculated correctly. The second tests does this, but I want to be sure, that they are correct for both group=-1 and group>=0.

          Show
          Uwe Schindler added a comment - Then you could use SOLR-1423 -fix-empty-tokens.patch it should work. The comparison with String.split() in one of the tests was commented out, because it does not work with the tokenizer (as empty tokens are not returned). I only wanted to check, that the offsets are calculated correctly. The second tests does this, but I want to be sure, that they are correct for both group=-1 and group>=0.
          Hide
          Uwe Schindler added a comment -

          Attached a new patch with the empty token fix.

          It has an additional test for the offsets, if group!=-1. It also is more optimized, as it uses setTermBuffer( string, offset, len) to copy the chars into the termbuffer, which is faster than allocating a new string with substring().

          Show
          Uwe Schindler added a comment - Attached a new patch with the empty token fix. It has an additional test for the offsets, if group!=-1. It also is more optimized, as it uses setTermBuffer( string, offset, len) to copy the chars into the termbuffer, which is faster than allocating a new string with substring().
          Hide
          Koji Sekiguchi added a comment -

          The patch looks good! Will commit shortly.

          Show
          Koji Sekiguchi added a comment - The patch looks good! Will commit shortly.
          Hide
          Koji Sekiguchi added a comment -

          Committed revision 816502. Thanks, Uwe!

          Show
          Koji Sekiguchi added a comment - Committed revision 816502. Thanks, Uwe!
          Hide
          Grant Ingersoll added a comment -

          Bulk close for Solr 1.4

          Show
          Grant Ingersoll added a comment - Bulk close for Solr 1.4

            People

            • Assignee:
              Koji Sekiguchi
              Reporter:
              Uwe Schindler
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development