Lucene - Core
  1. Lucene - Core
  2. LUCENE-5388

Eliminate construction over readers for Tokenizer

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0, 6.0
    • Component/s: core/other
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      In the modern world, Tokenizers are intended to be reusable, with input supplied via #setReader. The constructors that take Reader are a vestige. Worse yet, they invite people to make mistakes in handling the reader that tangle them up with the state machine in Tokenizer. The sensible thing is to eliminate these ctors, and force setReader usage.

      1. LUCENE-5388.patch
        646 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        +1, its really silly its this way. I guess its the right thing to do this for 5.0 only: i wish we had done it for 4.0, but it is what it is.

        Should be a rather large and noisy change unfortunately. I can help, let me know.

        Show
        Robert Muir added a comment - +1, its really silly its this way. I guess its the right thing to do this for 5.0 only: i wish we had done it for 4.0, but it is what it is. Should be a rather large and noisy change unfortunately. I can help, let me know.
        Hide
        Michael McCandless added a comment -

        +1, it's weird that the ctor takes a Reader and we also have a setReader. This is a relic from the pre-reuse days...

        Show
        Michael McCandless added a comment - +1, it's weird that the ctor takes a Reader and we also have a setReader. This is a relic from the pre-reuse days...
        Hide
        Uwe Schindler added a comment -

        +1, but delay this until 5.0. Because there are many Tokenizers outside we should keep backwards compatibility.

        Show
        Uwe Schindler added a comment - +1, but delay this until 5.0. Because there are many Tokenizers outside we should keep backwards compatibility.
        Hide
        Benson Margulies added a comment - - edited

        Uwe, what's that mean practically? No PR yet? A PR just in trunk? Merging my recent doc to a 4.x branch? A feature branch where this goes to be merged in when the time is ripe?

        Show
        Benson Margulies added a comment - - edited Uwe, what's that mean practically? No PR yet? A PR just in trunk? Merging my recent doc to a 4.x branch? A feature branch where this goes to be merged in when the time is ripe?
        Hide
        Uwe Schindler added a comment -

        Commit to trunk only, not backport to branch_4x.

        Show
        Uwe Schindler added a comment - Commit to trunk only, not backport to branch_4x.
        Hide
        Robert Muir added a comment -

        Benson, he just means the patch would only be committed to trunk. I agree with this...

        Show
        Robert Muir added a comment - Benson, he just means the patch would only be committed to trunk. I agree with this...
        Hide
        Benson Margulies added a comment -

        How about we start by adding ctors that don't require a reader, and do treat them as 4.x fodder?

        Show
        Benson Margulies added a comment - How about we start by adding ctors that don't require a reader, and do treat them as 4.x fodder?
        Hide
        Benson Margulies added a comment -

        setReader throws IOException, but the existing constructors don't. Analyzer 'createComponents' doesn't. How to sort this out?

        Show
        Benson Margulies added a comment - setReader throws IOException, but the existing constructors don't. Analyzer 'createComponents' doesn't. How to sort this out?
        Hide
        Robert Muir added a comment -

        How about we start by adding ctors that don't require a reader, and do treat them as 4.x fodder?

        I'd prefer not, because then there needs to be very sophisticated backwards compat to know which one to call. and subclassing gets complicated.

        I would really prefer we just choose to fix the API, either 1) 5.0-only or 2) break it in a 4.x release

        From my perspective, Benson would be probably be the one most impacted by such a 4.x break. So if he really wants to do this, I have no problem.

        setReader throws IOException, but the existing constructors don't. Analyzer 'createComponents' doesn't. How to sort this out?

        I dont see the problem. I think createComponents doesnt need to throw exception: instead the logic of Analyzer.tokenStream changes slightly, to call components.setReader(r) in both cases of the if-then-else. Make sense?

        Show
        Robert Muir added a comment - How about we start by adding ctors that don't require a reader, and do treat them as 4.x fodder? I'd prefer not, because then there needs to be very sophisticated backwards compat to know which one to call. and subclassing gets complicated. I would really prefer we just choose to fix the API, either 1) 5.0-only or 2) break it in a 4.x release From my perspective, Benson would be probably be the one most impacted by such a 4.x break. So if he really wants to do this, I have no problem. setReader throws IOException, but the existing constructors don't. Analyzer 'createComponents' doesn't. How to sort this out? I dont see the problem. I think createComponents doesnt need to throw exception: instead the logic of Analyzer.tokenStream changes slightly, to call components.setReader(r) in both cases of the if-then-else. Make sense?
        Hide
        Benson Margulies added a comment -

        OK, I see. If we don't do compatibility, then no one calls setReader in createComponents, and all is well. OK, I'm proceeding.

        Show
        Benson Margulies added a comment - OK, I see. If we don't do compatibility, then no one calls setReader in createComponents, and all is well. OK, I'm proceeding.
        Hide
        Benson Margulies added a comment -

        Why does the reader get passed to createComponents in this model? Should that param go away?

        Show
        Benson Margulies added a comment - Why does the reader get passed to createComponents in this model? Should that param go away?
        Hide
        Benson Margulies added a comment -

        https://github.com/apache/lucene-solr/pull/16 is available for your read pleasure to see what these changes look like.

        Show
        Benson Margulies added a comment - https://github.com/apache/lucene-solr/pull/16 is available for your read pleasure to see what these changes look like.
        Hide
        Benson Margulies added a comment -

        Robert Muir Next frontier is TokenizerFactory.

        Do we change #create to not take a reader, or do we add 'throws IOException'? Based on comments above, I'd think we take out the reader.

        Michael McCandless I would love help. If you tell me your github id, I'll add you to my repo, and you can take up some of the ton of editing.

        Show
        Benson Margulies added a comment - Robert Muir Next frontier is TokenizerFactory. Do we change #create to not take a reader, or do we add 'throws IOException'? Based on comments above, I'd think we take out the reader. Michael McCandless I would love help. If you tell me your github id, I'll add you to my repo, and you can take up some of the ton of editing.
        Hide
        Robert Muir added a comment -

        Why does the reader get passed to createComponents in this model? Should that param go away?

        Yes: please nuke it!

        Show
        Robert Muir added a comment - Why does the reader get passed to createComponents in this model? Should that param go away? Yes: please nuke it!
        Hide
        Robert Muir added a comment -

        Do we change #create to not take a reader, or do we add 'throws IOException'? Based on comments above, I'd think we take out the reader.

        Yes, please. its not like a factory can do anything fancy with it anyway: its only called the first time, subsequent readers are invoked with setReader. so this is just more of the same, please nuke it!

        Show
        Robert Muir added a comment - Do we change #create to not take a reader, or do we add 'throws IOException'? Based on comments above, I'd think we take out the reader. Yes, please. its not like a factory can do anything fancy with it anyway: its only called the first time, subsequent readers are invoked with setReader. so this is just more of the same, please nuke it!
        Hide
        Uwe Schindler added a comment -

        Yes, createComponents should no longer get a Reader, too! Same for factories. The factory just creates the instance, setting the reading is up to the consumer.

        Show
        Uwe Schindler added a comment - Yes, createComponents should no longer get a Reader, too! Same for factories. The factory just creates the instance, setting the reading is up to the consumer.
        Hide
        Uwe Schindler added a comment -

        The cool thing is: In Analyzer we may simplify things like the initReader() protected method. We should also look at those APIs. Most of the code in Analyzer is to work around the ctor / setReader stuff.

        Show
        Uwe Schindler added a comment - The cool thing is: In Analyzer we may simplify things like the initReader() protected method. We should also look at those APIs. Most of the code in Analyzer is to work around the ctor / setReader stuff.
        Hide
        Benson Margulies added a comment -

        Robert Muir or Michael McCandless I could really use some help here with TestRandomChains.

        Show
        Benson Margulies added a comment - Robert Muir or Michael McCandless I could really use some help here with TestRandomChains.
        Hide
        Robert Muir added a comment -

        its a monster...

        Show
        Robert Muir added a comment - its a monster...
        Hide
        Benson Margulies added a comment -

        It does something complex with the input reader in a createComponents. the challenge is to move all that to initReader so that it works. I think I'm too fried from 1000 other edits, I'll look in after dinner but anyone who wants to grab my branch from github and pitch in is more than welcome.

        Show
        Benson Margulies added a comment - It does something complex with the input reader in a createComponents. the challenge is to move all that to initReader so that it works. I think I'm too fried from 1000 other edits, I'll look in after dinner but anyone who wants to grab my branch from github and pitch in is more than welcome.
        Hide
        Robert Muir added a comment -

        Yes: well the CheckThatYouDidntReadAnythingReaderWrapper can likely be removed. you are removing that possibility entirely. so it should get simpler... ill have a look at your branch tonight and try to help with some of this stuff, its hairy.

        Show
        Robert Muir added a comment - Yes: well the CheckThatYouDidntReadAnythingReaderWrapper can likely be removed. you are removing that possibility entirely. so it should get simpler... ill have a look at your branch tonight and try to help with some of this stuff, its hairy.
        Hide
        Benson Margulies added a comment -

        You got me off the dot on RandomChains.

        Show
        Benson Margulies added a comment - You got me off the dot on RandomChains.
        Hide
        Benson Margulies added a comment -

        I've got all of lucene to compile, and a bunch of tests running.

        Show
        Benson Margulies added a comment - I've got all of lucene to compile, and a bunch of tests running.
        Hide
        Benson Margulies added a comment - - edited

        I have Solr test failures in PreAnalyzedField, which has some stubborn fondness for the idea of a reader passed to a constructor. But that seems to be all that's broken; a few Solr failures (based on 'ant test').

        Show
        Benson Margulies added a comment - - edited I have Solr test failures in PreAnalyzedField, which has some stubborn fondness for the idea of a reader passed to a constructor. But that seems to be all that's broken; a few Solr failures (based on 'ant test').
        Hide
        Robert Muir added a comment -

        The changes look great overall. This particular one is tricky. Lemme take a look...

        Show
        Robert Muir added a comment - The changes look great overall. This particular one is tricky. Lemme take a look...
        Hide
        Robert Muir added a comment -
        diff --git a/lucene/analysis/common/src/java/org/apache/lucene/analysis/util/AbstractAnalysisFactory.java b/lucene/analysis/common/src/java/org/apache/lucene/analysis/util/AbstractAnalysisFactory.java
        index 6ac073a..534c166 100644
        --- a/lucene/analysis/common/src/java/org/apache/lucene/analysis/util/AbstractAnalysisFactory.java
        +++ b/lucene/analysis/common/src/java/org/apache/lucene/analysis/util/AbstractAnalysisFactory.java
        @@ -75,7 +75,7 @@ public abstract class AbstractAnalysisFactory {
             return originalArgs;
           }
         
        -   /** this method can be called in the {@link org.apache.lucene.analysis.util.TokenizerFactory#create(java.io.Reader)}
        +   /** this method can be called in the {@link org.apache.lucene.analysis.util.TokenizerFactory#create()}
            * or {@link org.apache.lucene.analysis.util.TokenFilterFactory#create(org.apache.lucene.analysis.TokenStream)} methods,
            * to inform user, that for this factory a {@link #luceneMatchVersion} is required */
           protected final void assureMatchVersion() {
        diff --git a/solr/core/src/java/org/apache/solr/schema/PreAnalyzedField.java b/solr/core/src/java/org/apache/solr/schema/PreAnalyzedField.java
        index fb741a5..3d3e8e2 100644
        --- a/solr/core/src/java/org/apache/solr/schema/PreAnalyzedField.java
        +++ b/solr/core/src/java/org/apache/solr/schema/PreAnalyzedField.java
        @@ -290,6 +290,7 @@ public class PreAnalyzedField extends FieldType {
           
             @Override
             public final void reset() throws IOException {
        +      super.reset();
               /* This is called after setReader, so here's how we can get the input we care about ... */
               this.input = super.input;
               // NOTE: this acts like rewind if you call it again
        
        Show
        Robert Muir added a comment - diff --git a/lucene/analysis/common/src/java/org/apache/lucene/analysis/util/AbstractAnalysisFactory.java b/lucene/analysis/common/src/java/org/apache/lucene/analysis/util/AbstractAnalysisFactory.java index 6ac073a..534c166 100644 --- a/lucene/analysis/common/src/java/org/apache/lucene/analysis/util/AbstractAnalysisFactory.java +++ b/lucene/analysis/common/src/java/org/apache/lucene/analysis/util/AbstractAnalysisFactory.java @@ -75,7 +75,7 @@ public abstract class AbstractAnalysisFactory { return originalArgs; } - /** this method can be called in the {@link org.apache.lucene.analysis.util.TokenizerFactory#create(java.io.Reader)} + /** this method can be called in the {@link org.apache.lucene.analysis.util.TokenizerFactory#create()} * or {@link org.apache.lucene.analysis.util.TokenFilterFactory#create(org.apache.lucene.analysis.TokenStream)} methods, * to inform user, that for this factory a {@link #luceneMatchVersion} is required */ protected final void assureMatchVersion() { diff --git a/solr/core/src/java/org/apache/solr/schema/PreAnalyzedField.java b/solr/core/src/java/org/apache/solr/schema/PreAnalyzedField.java index fb741a5..3d3e8e2 100644 --- a/solr/core/src/java/org/apache/solr/schema/PreAnalyzedField.java +++ b/solr/core/src/java/org/apache/solr/schema/PreAnalyzedField.java @@ -290,6 +290,7 @@ public class PreAnalyzedField extends FieldType { @Override public final void reset() throws IOException { + super.reset(); /* This is called after setReader, so here's how we can get the input we care about ... */ this.input = super.input; // NOTE: this acts like rewind if you call it again
        Hide
        Robert Muir added a comment -

        Benson's changes, merged up to latest trunk as a patch for review.

        I further simplified the PreAnalyzed stuff (only possible due to this change), because it still had problems, and fixed the test for uppercasefilter.

        I think this is good to go.

        Show
        Robert Muir added a comment - Benson's changes, merged up to latest trunk as a patch for review. I further simplified the PreAnalyzed stuff (only possible due to this change), because it still had problems, and fixed the test for uppercasefilter. I think this is good to go.
        Hide
        Benson Margulies added a comment -

        Should I try to get the branch in git to match the .patch, or should I just let you proceed from here? I guess that might depend on reactions of others.

        Show
        Benson Margulies added a comment - Should I try to get the branch in git to match the .patch, or should I just let you proceed from here? I guess that might depend on reactions of others.
        Hide
        Robert Muir added a comment -

        Benson, dont worry about it, I think its good. I just put up the patch so that Uwe might look at it.

        Show
        Robert Muir added a comment - Benson, dont worry about it, I think its good. I just put up the patch so that Uwe might look at it.
        Hide
        Uwe Schindler added a comment -

        I am fine with this patch in trunk only. We can decide later if we backport.

        Show
        Uwe Schindler added a comment - I am fine with this patch in trunk only. We can decide later if we backport.
        Hide
        ASF subversion and git services added a comment -

        Commit 1556801 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1556801 ]

        LUCENE-5388: remove Reader from Tokenizer ctor (closes #16)

        Show
        ASF subversion and git services added a comment - Commit 1556801 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1556801 ] LUCENE-5388 : remove Reader from Tokenizer ctor (closes #16)
        Hide
        Uwe Schindler added a comment -

        Juhu!

        Show
        Uwe Schindler added a comment - Juhu!
        Hide
        Robert Muir added a comment -

        Marking fixed for 5.0.
        Thanks a lot Benson for doing all the grunt work here.

        Note: If you really want a backport please just open an issue and hash it out.

        Show
        Robert Muir added a comment - Marking fixed for 5.0. Thanks a lot Benson for doing all the grunt work here. Note: If you really want a backport please just open an issue and hash it out.
        Hide
        Anshum Gupta added a comment -

        Bulk close after 5.0 release.

        Show
        Anshum Gupta added a comment - Bulk close after 5.0 release.

          People

          • Assignee:
            Unassigned
            Reporter:
            Benson Margulies
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development