Lucene - Core
  1. Lucene - Core
  2. LUCENE-1906

Backwards problems with CharStream and Tokenizers with custom reset(Reader) method

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.9
    • Fix Version/s: 2.9
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      When reviewing the new CharStream code added to Tokenizers, I found a
      serious problem with backwards compatibility and other Tokenizers, that do
      not override reset(CharStream).

      The problem is, that e.g. CharTokenizer only overrides reset(Reader):

        public void reset(Reader input) throws IOException {
          super.reset(input);
          bufferIndex = 0;
          offset = 0;
          dataLen = 0;
        }
      

      If you reset such a Tokenizer with another CharStream (not a Reader), this
      method will never be called and breaking the whole Tokenizer.

      As CharStream extends Reader, I propose to remove this reset(CharStream
      method) and simply do an instanceof check to detect if the supplied Reader
      is no CharStream and wrap it. We could also remove the extra ctor (because
      most Tokenizers have no support for passing CharStreams). If the ctor also
      checks with instanceof and warps as needed the code is backwards compatible
      and we do not need to add additional ctors in subclasses.

      As this instanceof check is always done in CharReader.get() why not remove
      ctor(CharStream) and reset(CharStream) completely?

      Any thoughts?

      I would like to fix this somehow before RC4, I'm, sorry

      1. LUCENE-1906.patch
        2 kB
        Uwe Schindler
      2. backwards-break.patch
        4 kB
        Uwe Schindler
      3. LUCENE-1906.patch
        8 kB
        Uwe Schindler
      4. LUCENE-1906_contrib.patch
        6 kB
        Robert Muir
      5. LUCENE-1906.patch
        16 kB
        Uwe Schindler
      6. LUCENE-1906-bw.patch
        6 kB
        Uwe Schindler
      7. LUCENE-1906.patch
        22 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          I will now check, if the change of the "input" member variable leads to backwards breaks (it was changed from Reader to CharStream)...

          Show
          Uwe Schindler added a comment - I will now check, if the change of the "input" member variable leads to backwards breaks (it was changed from Reader to CharStream)...
          Hide
          Yonik Seeley added a comment -

          +1, this looks like the best fix.

          Show
          Yonik Seeley added a comment - +1, this looks like the best fix.
          Hide
          Michael McCandless added a comment -

          +1, good catch Uwe!

          Show
          Michael McCandless added a comment - +1, good catch Uwe!
          Hide
          Mark Miller added a comment -

          Ready Uwe?

          Show
          Mark Miller added a comment - Ready Uwe?
          Hide
          Uwe Schindler added a comment -

          I will now check, if the change of the "input" member variable leads to backwards breaks (it was changed from Reader to CharStream)...

          We also have a not-in-CHANGES.txt backwards break. We changed Reader to CharStream. When committing this change, also the backwards-branch was modified to use CharStream. I reverted this change (to get the state of a legacy Tokenizer) and wrote a Test, that called input.read() inside the Tokenizer. The test compiled correct in backwards and also run correct.

          In trunk, the test-tag failed:

              [junit] Testcase: testChangeToCharStream29(org.apache.lucene.analysis.TestTokenizer):       Caused an ERROR
              [junit] input
              [junit] java.lang.NoSuchFieldError: input
              [junit]     at org.apache.lucene.analysis.TestTokenizer$1.next(TestTokenizer.java:43)
              [junit]     at org.apache.lucene.analysis.TestTokenizer.testChangeToCharStream29(TestTokenizer.java:58)
              [junit]
              [junit]
              [junit] Test org.apache.lucene.analysis.TestTokenizer FAILED
          

          So somebody cannot use his old Tokenizers without recompiling (because Java is not able to respect the type change). After recompiling his classes it works.

          If we want to do this, we should clearly state this in CHANGES at the backwards-breaks.

          Show
          Uwe Schindler added a comment - I will now check, if the change of the "input" member variable leads to backwards breaks (it was changed from Reader to CharStream)... We also have a not-in-CHANGES.txt backwards break. We changed Reader to CharStream. When committing this change, also the backwards-branch was modified to use CharStream. I reverted this change (to get the state of a legacy Tokenizer) and wrote a Test, that called input.read() inside the Tokenizer. The test compiled correct in backwards and also run correct. In trunk, the test-tag failed: [junit] Testcase: testChangeToCharStream29(org.apache.lucene.analysis.TestTokenizer): Caused an ERROR [junit] input [junit] java.lang.NoSuchFieldError: input [junit] at org.apache.lucene.analysis.TestTokenizer$1.next(TestTokenizer.java:43) [junit] at org.apache.lucene.analysis.TestTokenizer.testChangeToCharStream29(TestTokenizer.java:58) [junit] [junit] [junit] Test org.apache.lucene.analysis.TestTokenizer FAILED So somebody cannot use his old Tokenizers without recompiling (because Java is not able to respect the type change). After recompiling his classes it works. If we want to do this, we should clearly state this in CHANGES at the backwards-breaks.
          Hide
          Uwe Schindler added a comment -

          Here is the patch for backwards-branch, that fails. It reverts the change from Reader to CharStream and adds a dummy Tokenizer in the test that compiles, but does not run in trunk.

          Show
          Uwe Schindler added a comment - Here is the patch for backwards-branch, that fails. It reverts the change from Reader to CharStream and adds a dummy Tokenizer in the test that compiles, but does not run in trunk.
          Hide
          Uwe Schindler added a comment -

          Sorry, wrong patch, this one is correct. Other one was a even harder test by assigning a value.

          Show
          Uwe Schindler added a comment - Sorry, wrong patch, this one is correct. Other one was a even harder test by assigning a value.
          Hide
          Uwe Schindler added a comment - - edited

          One possibility to prevent this break would be the following:

          • revert the CharStream to a Reader in Tokenizer (restore the old class).
          • add a method correctOffset(int) to Tokenizer that does:
            protected final int correctOffset(int offset) {
              if (input instanceof CharStream) return ((CharStream) input).correctOffset(offset);
              return offset;
            }
            
          • Change the usage of correctOffset in all Tokenizers in core/contrib (only a few are affected)
          • revert the backwards-branch changes

          In this case, the input of a Tokenizer stays always a j.io.Reader and the offset correction defaults to just nothing. If a custom Reader extends CharStream, the offset correction is automatically used. This is 100% backwards compatible (only some old Tokenizers not calling correctOffset() before passing to Token/OffsetAttribute would produce invalid offsets if used with CharStreams instead of Readers)

          Show
          Uwe Schindler added a comment - - edited One possibility to prevent this break would be the following: revert the CharStream to a Reader in Tokenizer (restore the old class). add a method correctOffset(int) to Tokenizer that does: protected final int correctOffset( int offset) { if (input instanceof CharStream) return ((CharStream) input).correctOffset(offset); return offset; } Change the usage of correctOffset in all Tokenizers in core/contrib (only a few are affected) revert the backwards-branch changes In this case, the input of a Tokenizer stays always a j.io.Reader and the offset correction defaults to just nothing. If a custom Reader extends CharStream, the offset correction is automatically used. This is 100% backwards compatible (only some old Tokenizers not calling correctOffset() before passing to Token/OffsetAttribute would produce invalid offsets if used with CharStreams instead of Readers)
          Hide
          Mark Miller added a comment -

          What about using an introspection cache again? Kind of nasty depending on how soon we could get rid of it.

          Personally, I think I'd prefer a break to a solution that doesnt solve every case.

          Show
          Mark Miller added a comment - What about using an introspection cache again? Kind of nasty depending on how soon we could get rid of it. Personally, I think I'd prefer a break to a solution that doesnt solve every case.
          Hide
          Robert Muir added a comment -

          (only some old Tokenizers not calling correctOffset() before passing to Token/OffsetAttribute would produce invalid offsets if used with CharStreams instead of Readers)

          but don't you have this problem already? I like Uwe's approach.

          Show
          Robert Muir added a comment - (only some old Tokenizers not calling correctOffset() before passing to Token/OffsetAttribute would produce invalid offsets if used with CharStreams instead of Readers) but don't you have this problem already? I like Uwe's approach.
          Hide
          Uwe Schindler added a comment -

          What about using an introspection cache again? Kind of nasty depending on how soon we could get rid of it.

          A cache for what? I do not understand The problem is the change of the member field.

          but don't you have this problem already? I like Uwe's approach.

          Correct, this is always the problem with this change. The offsets are wrong if one legacy Tokenizer does not correct the offset (either through Tokenizer.correctOffset() or CharStream.correctOffset()).

          Show
          Uwe Schindler added a comment - What about using an introspection cache again? Kind of nasty depending on how soon we could get rid of it. A cache for what? I do not understand The problem is the change of the member field. but don't you have this problem already? I like Uwe's approach. Correct, this is always the problem with this change. The offsets are wrong if one legacy Tokenizer does not correct the offset (either through Tokenizer.correctOffset() or CharStream.correctOffset()).
          Hide
          Michael McCandless added a comment -

          We also have a not-in-CHANGES.txt backwards break.

          Sigh, I forgot to also add an entry to "Changes in back compat policy"
          when we made this change (in LUCENE-1466).

          We had decided at the time to make an exception to back-compat, ie
          allow reader to become CharStream, but we can definitely revisit
          that...

          One possibility to prevent this break would be the following

          Adding an instanceof check in every correctOffset call makes me
          nervous – what's the performance cost?

          Show
          Michael McCandless added a comment - We also have a not-in-CHANGES.txt backwards break. Sigh, I forgot to also add an entry to "Changes in back compat policy" when we made this change (in LUCENE-1466 ). We had decided at the time to make an exception to back-compat, ie allow reader to become CharStream, but we can definitely revisit that... One possibility to prevent this break would be the following Adding an instanceof check in every correctOffset call makes me nervous – what's the performance cost?
          Hide
          Robert Muir added a comment -

          Correct, this is always the problem with this change. The offsets are wrong if one legacy Tokenizer does not correct the offset (either through Tokenizer.correctOffset() or CharStream.correctOffset()).

          right, so your proposed fix would correct this issue (the back compat break), but it doesn't introduce any new issue.

          if you want to use CharFilter, then your tokenizer must call correctOffset() so the offsets will be correct.

          Show
          Robert Muir added a comment - Correct, this is always the problem with this change. The offsets are wrong if one legacy Tokenizer does not correct the offset (either through Tokenizer.correctOffset() or CharStream.correctOffset()). right, so your proposed fix would correct this issue (the back compat break), but it doesn't introduce any new issue. if you want to use CharFilter, then your tokenizer must call correctOffset() so the offsets will be correct.
          Hide
          Michael McCandless added a comment -

          Correct, this is always the problem with this change.

          Right, we should separately fix this (so that all of our core/contrib tokenizers invoke correctOffset).

          Show
          Michael McCandless added a comment - Correct, this is always the problem with this change. Right, we should separately fix this (so that all of our core/contrib tokenizers invoke correctOffset).
          Hide
          Mark Miller added a comment -

          A cache for what? I do not understand The problem is the change of the member field.

          Sorry - I meant to say reflection. I mean check if the instance just overrides reset(Reader), and if so, call it with the CharReader from reset(Reader) (and cache the reflection results) - perhaps it gets more complicated though - I havn't thought on it - just throwing it out. I've glossed over the code involved but I'm still at a pretty high level with it, as I trust there is enough people that already know what they are doing in this area. I was never the kid in a group racing to have the math answer first

          Correct, this is always the problem with this change. The offsets are wrong if one legacy Tokenizer does not correct the offset (either through Tokenizer.correctOffset() or CharStream.correctOffset()).

          Okay - I took it as - this is 100% backwards compatible, except for this exception where it is not. But if thats always been the case, it is back compat and it sounds fine to me - just got caught up in the wording.

          So if its good, lets ship it! It will take me a bit of time to ensure all the tests and package up, and transfer over again (man does all that maven stuff make transfer take forever - too many little files)

          Show
          Mark Miller added a comment - A cache for what? I do not understand The problem is the change of the member field. Sorry - I meant to say reflection. I mean check if the instance just overrides reset(Reader), and if so, call it with the CharReader from reset(Reader) (and cache the reflection results) - perhaps it gets more complicated though - I havn't thought on it - just throwing it out. I've glossed over the code involved but I'm still at a pretty high level with it, as I trust there is enough people that already know what they are doing in this area. I was never the kid in a group racing to have the math answer first Correct, this is always the problem with this change. The offsets are wrong if one legacy Tokenizer does not correct the offset (either through Tokenizer.correctOffset() or CharStream.correctOffset()). Okay - I took it as - this is 100% backwards compatible, except for this exception where it is not. But if thats always been the case, it is back compat and it sounds fine to me - just got caught up in the wording. So if its good, lets ship it! It will take me a bit of time to ensure all the tests and package up, and transfer over again (man does all that maven stuff make transfer take forever - too many little files)
          Hide
          Uwe Schindler added a comment -

          Here the patch for core. Contrib is unchanged.

          In principle just replace "input.correctOffset(" by "correctOffset(" everywhere. All core tests pass. The backwards branch must simply revert the commit of LUCENE-1466.

          Show
          Uwe Schindler added a comment - Here the patch for core. Contrib is unchanged. In principle just replace "input.correctOffset(" by "correctOffset(" everywhere. All core tests pass. The backwards branch must simply revert the commit of LUCENE-1466 .
          Hide
          Michael McCandless added a comment -

          I'm still nervous about inserting an instanceof check plus a cast in to every correctOffset call (which is called twice per token).

          For external code that accesses Tokenizer.input, a recompile is all that's required (which 2.9 already requires). Subclasses that directly assign to input (if any) really should be calling reset instead. I think breaking back compat here is OK?

          Show
          Michael McCandless added a comment - I'm still nervous about inserting an instanceof check plus a cast in to every correctOffset call (which is called twice per token). For external code that accesses Tokenizer.input, a recompile is all that's required (which 2.9 already requires). Subclasses that directly assign to input (if any) really should be calling reset instead. I think breaking back compat here is OK?
          Hide
          Uwe Schindler added a comment -

          "instanceof" is one of the operators directly implemented in the Java Bytecode as a separate instruction. It is fast and used everywhere. See various articles about it.

          Show
          Uwe Schindler added a comment - "instanceof" is one of the operators directly implemented in the Java Bytecode as a separate instruction. It is fast and used everywhere. See various articles about it.
          Hide
          Robert Muir added a comment -

          contrib changes.

          Show
          Robert Muir added a comment - contrib changes.
          Hide
          Mark Miller added a comment -

          I think breaking back compat here is OK?

          I won't comment from a tech perspective, but in regards to this, that was my initial thought as well - if its just a recompile, we are saying 2.9 needs
          to recompile anyway, else you might blow up - essentially is required as it is.

          Show
          Mark Miller added a comment - I think breaking back compat here is OK? I won't comment from a tech perspective, but in regards to this, that was my initial thought as well - if its just a recompile, we are saying 2.9 needs to recompile anyway, else you might blow up - essentially is required as it is.
          Hide
          Yonik Seeley added a comment -

          "instanceof" is one of the operators directly implemented in the Java Bytecode as a separate instruction.

          My computer doesn't run bytecode ;-P
          Yes, it's relatively fast, but it's per-token too.

          a recompile is all that's required (which 2.9 already requires)

          Hmmm, I had missed that 2.9 required a recompile. In that case it doesn't seem like there is any additional back compat breakage and thus the correct fix would be Uwe's first patch?

          Show
          Yonik Seeley added a comment - "instanceof" is one of the operators directly implemented in the Java Bytecode as a separate instruction. My computer doesn't run bytecode ;-P Yes, it's relatively fast, but it's per-token too. a recompile is all that's required (which 2.9 already requires) Hmmm, I had missed that 2.9 required a recompile. In that case it doesn't seem like there is any additional back compat breakage and thus the correct fix would be Uwe's first patch?
          Hide
          Mark Miller added a comment -

          Hmmm, I had missed that 2.9 required a recompile.

          Because it was never officially said one way or another, I was pretty loose about it:

          We recommend that you recompile your application with Lucene 2.9
          rather than attempting to "drop" it in. This will alert you to any
          issues you may have to fix if you are affected by one of the backward
          compatibility breaks. As always, its a really good idea to thoroughly
          read CHANGES.txt before upgrading. Also, remember that this is a
          release candidate, and not the final Lucene 2.9 release.

          I should probably change it to required. I am saying, recompile or be
          prepared for a random break. Thats got to be a euphemism for recompile required.

          Show
          Mark Miller added a comment - Hmmm, I had missed that 2.9 required a recompile. Because it was never officially said one way or another, I was pretty loose about it: We recommend that you recompile your application with Lucene 2.9 rather than attempting to "drop" it in. This will alert you to any issues you may have to fix if you are affected by one of the backward compatibility breaks. As always, its a really good idea to thoroughly read CHANGES.txt before upgrading. Also, remember that this is a release candidate, and not the final Lucene 2.9 release. I should probably change it to required. I am saying, recompile or be prepared for a random break. Thats got to be a euphemism for recompile required.
          Hide
          Uwe Schindler added a comment - - edited

          Yes, it's relatively fast, but it's per-token too.

          It is once per token. But you do not need to wrap the input Reader using CharReader if you do not want to use CharFilters. If you wrap each call to Reader by CharReader you have a larger overhead (one additional method call per char read, if you tokenize using Reader.read()!).

          Hmmm, I had missed that 2.9 required a recompile. In that case it doesn't seem like there is any additional back compat breakage and thus the correct fix would be Uwe's first patch?

          A recompile is only needed is rare caces (if you override Scorers and so on). If you do not do any very-special Lucene usages, it works without recompiling. In my opinion, e.g. external language Tokenizer-Packages (as Michael Busch calls them) without source code would not work. This example is always brought by Michael.

          If we accept the break, the first patch is all we need to do (and move the CHANGES entry up to the bw-breaks).

          Show
          Uwe Schindler added a comment - - edited Yes, it's relatively fast, but it's per-token too. It is once per token. But you do not need to wrap the input Reader using CharReader if you do not want to use CharFilters. If you wrap each call to Reader by CharReader you have a larger overhead (one additional method call per char read, if you tokenize using Reader.read()!). Hmmm, I had missed that 2.9 required a recompile. In that case it doesn't seem like there is any additional back compat breakage and thus the correct fix would be Uwe's first patch? A recompile is only needed is rare caces (if you override Scorers and so on). If you do not do any very-special Lucene usages, it works without recompiling. In my opinion, e.g. external language Tokenizer-Packages (as Michael Busch calls them) without source code would not work. This example is always brought by Michael. If we accept the break, the first patch is all we need to do (and move the CHANGES entry up to the bw-breaks).
          Hide
          Mark Miller added a comment -

          In my opinion, e.g. external language Tokenizer-Packages (as Michael Busch calls them) without source code would not work. This example is always brought by Michael.

          Excellent point. Hadn't seen it before or didn't remember it.

          Show
          Mark Miller added a comment - In my opinion, e.g. external language Tokenizer-Packages (as Michael Busch calls them) without source code would not work. This example is always brought by Michael. Excellent point. Hadn't seen it before or didn't remember it.
          Hide
          Michael McCandless added a comment -

          A recompile is only needed is rare caces (if you override Scorers and so on

          Or implement Searchable (an interface that we've added methods to), or implemented Weight (an interface that we changed to an abstract class), or your own MergePolicy (IndexWriter instance now required to ctor). I agree these ones are way-expert things.

          In my opinion, e.g. external language Tokenizer-Packages (as Michael Busch calls them) without source code would not work. This example is always brought by Michael.

          Excellent point. Hadn't seen it before or didn't remember it.

          OK I agree, this does make me nervous, too.

          OK you've convinced me Uwe: I think we should in fact restore input to be a Reader not a CharStream. I think the potential performance hit is the lesser evil here.

          Maybe for 3.0 we can declare that this will become a CharStream?

          Show
          Michael McCandless added a comment - A recompile is only needed is rare caces (if you override Scorers and so on Or implement Searchable (an interface that we've added methods to), or implemented Weight (an interface that we changed to an abstract class), or your own MergePolicy (IndexWriter instance now required to ctor). I agree these ones are way-expert things. In my opinion, e.g. external language Tokenizer-Packages (as Michael Busch calls them) without source code would not work. This example is always brought by Michael. Excellent point. Hadn't seen it before or didn't remember it. OK I agree, this does make me nervous, too. OK you've convinced me Uwe: I think we should in fact restore input to be a Reader not a CharStream. I think the potential performance hit is the lesser evil here. Maybe for 3.0 we can declare that this will become a CharStream?
          Hide
          Uwe Schindler added a comment -

          Maybe for 3.0 we can declare that this will become a CharStream?

          I think the instanceof check is less evil than:

          Yes, it's relatively fast, but it's per-token too.

          It is once per token. But you do not need to wrap the input Reader using CharReader if you do not want to use CharFilters. If you wrap each call to Reader by CharReader you have a larger overhead (one additional method call per char read, if you tokenize using Reader.read()!).

          I think we should wait a while and think one night about it. Lets move RC4 to tomorrow morning.

          We have both possibilities, let's collect arguments +/-

          Excellent point. Hadn't seen it before or didn't remember it.

          He brought this up several times in the TokenStream discussion. This is why we mad this very fancy backwards layer that works with many special usages of TokenStreams like subclassing Token and so on (see extra BW Test). Hard stuff And this because of this argument.

          Show
          Uwe Schindler added a comment - Maybe for 3.0 we can declare that this will become a CharStream? I think the instanceof check is less evil than: Yes, it's relatively fast, but it's per-token too. It is once per token. But you do not need to wrap the input Reader using CharReader if you do not want to use CharFilters. If you wrap each call to Reader by CharReader you have a larger overhead (one additional method call per char read, if you tokenize using Reader.read()!). I think we should wait a while and think one night about it. Lets move RC4 to tomorrow morning. We have both possibilities, let's collect arguments +/- Excellent point. Hadn't seen it before or didn't remember it. He brought this up several times in the TokenStream discussion. This is why we mad this very fancy backwards layer that works with many special usages of TokenStreams like subclassing Token and so on (see extra BW Test). Hard stuff And this because of this argument.
          Hide
          Uwe Schindler added a comment -

          Here the updated patches for trunk (some missing occurences of CharReader-wrapping and so on) and backwards-branch (revert changes).

          The changes also show one posiible problem with current trunk: StandardTokenizer did not call super.reset(stream) which maybe user-provided tokenizers could also fail. Changing of public/protected member fields is a bad idea (and one reason to not have non-final public fields and use get/set methods instead).

          All test pass. Any other thoughts about this? Yonik? Mark? Robert? Mike?

          Show
          Uwe Schindler added a comment - Here the updated patches for trunk (some missing occurences of CharReader-wrapping and so on) and backwards-branch (revert changes). The changes also show one posiible problem with current trunk: StandardTokenizer did not call super.reset(stream) which maybe user-provided tokenizers could also fail. Changing of public/protected member fields is a bad idea (and one reason to not have non-final public fields and use get/set methods instead). All test pass. Any other thoughts about this? Yonik? Mark? Robert? Mike?
          Hide
          Robert Muir added a comment -

          uwe, i like your patch.

          what was that StandardTokenizer setInput (that did not call super.reset) supposed to do ?

          Show
          Robert Muir added a comment - uwe, i like your patch. what was that StandardTokenizer setInput (that did not call super.reset) supposed to do ?
          Hide
          Uwe Schindler added a comment -

          It was never used and seems to be a relict from a time when the jflex file created the tokenizer directly (???). It is package private and used nowhere, so no problem with deleting it. At least it should call reset(Reader) if we leave it there.

          Show
          Uwe Schindler added a comment - It was never used and seems to be a relict from a time when the jflex file created the tokenizer directly (???). It is package private and used nowhere, so no problem with deleting it. At least it should call reset(Reader) if we leave it there.
          Hide
          Mark Miller added a comment -

          I say we go with it - 'instance of' will have no practical effect from what I can tell with any micro benches. That, plus what I've read has me not too worried.

          My vote is to go with with you have. We might as well wait till the morning at this point no matter what really - and that will give anyone else an opportunity to chime in.

          Show
          Mark Miller added a comment - I say we go with it - 'instance of' will have no practical effect from what I can tell with any micro benches. That, plus what I've read has me not too worried. My vote is to go with with you have. We might as well wait till the morning at this point no matter what really - and that will give anyone else an opportunity to chime in.
          Hide
          Michael McCandless added a comment -

          Patch looks good Uwe!

          Show
          Michael McCandless added a comment - Patch looks good Uwe!
          Hide
          Koji Sekiguchi added a comment -

          +1, patch looks good, thanks Uwe!

          Show
          Koji Sekiguchi added a comment - +1, patch looks good, thanks Uwe!
          Hide
          Uwe Schindler added a comment -

          Some tweaks in JavaDocs. I committed this patch.

          Show
          Uwe Schindler added a comment - Some tweaks in JavaDocs. I committed this patch.
          Hide
          Uwe Schindler added a comment -

          Committed revision: 813671

          There may be some changes needed in Solr (correctOffset in Tokenizer / CharStream->Reader), should I open an issue?

          Show
          Uwe Schindler added a comment - Committed revision: 813671 There may be some changes needed in Solr (correctOffset in Tokenizer / CharStream->Reader), should I open an issue?
          Hide
          Koji Sekiguchi added a comment -

          There may be some changes needed in Solr (correctOffset in Tokenizer / CharStream->Reader), should I open an issue?

          Yes, please.

          Show
          Koji Sekiguchi added a comment - There may be some changes needed in Solr (correctOffset in Tokenizer / CharStream->Reader), should I open an issue? Yes, please.
          Hide
          Uwe Schindler added a comment -

          It seems that Solr compiles with the new JAR files (there is only one compile error not related to this in PatternTokenizerFactory). Tests seem to work. Not tested thoroughly.

          There seem to be no custom analyzers using correctOffset at all in Solr.

          Show
          Uwe Schindler added a comment - It seems that Solr compiles with the new JAR files (there is only one compile error not related to this in PatternTokenizerFactory). Tests seem to work. Not tested thoroughly. There seem to be no custom analyzers using correctOffset at all in Solr.
          Hide
          Uwe Schindler added a comment -

          I created an issue (SOLR-1423). Maybe there is nothing to do, I am not very familar with Solr. But it should be checked, that all Tokenizers correct their offsets (what some of them do not seem to do).

          Show
          Uwe Schindler added a comment - I created an issue ( SOLR-1423 ). Maybe there is nothing to do, I am not very familar with Solr. But it should be checked, that all Tokenizers correct their offsets (what some of them do not seem to do).

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Uwe Schindler
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development