Lucene - Core
  1. Lucene - Core
  2. LUCENE-3396

Make TokenStream Reuse Mandatory for Analyzers

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      In LUCENE-2309 it became clear that we'd benefit a lot from Analyzer having to return reusable TokenStreams. This is a big chunk of work, but its time to bite the bullet.

      I plan to attack this in the following way:

      • Collapse the logic of ReusableAnalyzerBase into Analyzer
      • Add a ReuseStrategy abstraction to Analyzer which controls whether the TokenStreamComponents are reused globally (as they are today) or per-field.
      • Convert all Analyzers over to using TokenStreamComponents. I've already seen that some of the TokenStreams created in tests need some work to be reusable (even if they aren't reused).
      • Remove Analyzer.reusableTokenStream and convert everything over to using .tokenStream (which will now be returning reusable TokenStreams).
      1. LUCENE-3396-forgotten.patch
        9 kB
        Chris Male
      2. LUCENE-3396-rab.patch
        100 kB
        Chris Male
      3. LUCENE-3396-rab.patch
        102 kB
        Chris Male
      4. LUCENE-3396-rab.patch
        99 kB
        Chris Male
      5. LUCENE-3396-rab.patch
        104 kB
        Chris Male
      6. LUCENE-3396-rab.patch
        104 kB
        Chris Male
      7. LUCENE-3396-rab.patch
        103 kB
        Chris Male
      8. LUCENE-3396-rab.patch
        101 kB
        Chris Male
      9. LUCENE-3396-remaining-analyzers.patch
        43 kB
        Chris Male
      10. LUCENE-3396-remaining-merging.patch
        181 kB
        Chris Male

        Issue Links

          Activity

          Hide
          Chris Male added a comment -

          TokenStream reuse is now mandatory

          Show
          Chris Male added a comment - TokenStream reuse is now mandatory
          Hide
          Simon Willnauer added a comment -

          chris, this seems to be done no? can you close it?

          Show
          Simon Willnauer added a comment - chris, this seems to be done no? can you close it?
          Hide
          Chris Male added a comment -

          Committed revision 1175297.

          I don't want to mark this as resolved just yet. I want to spin off another sub-task to move all consumers over to reusableTokenStream (and then rename it back to tokenStream).

          Show
          Chris Male added a comment - Committed revision 1175297. I don't want to mark this as resolved just yet. I want to spin off another sub-task to move all consumers over to reusableTokenStream (and then rename it back to tokenStream).
          Hide
          Robert Muir added a comment -

          +1!

          Show
          Robert Muir added a comment - +1!
          Hide
          Chris Male added a comment -

          I haven't actually committed yet. I was hoping you'd have a chance to review before I did. I'll now commit.

          Show
          Chris Male added a comment - I haven't actually committed yet. I was hoping you'd have a chance to review before I did. I'll now commit.
          Hide
          Robert Muir added a comment -

          I think this one can be marked resolved?

          by the way, I reviewed the patches here: thanks for putting in all this effort!
          this fixes my biggest pet peeve about lucene....awesome.

          Show
          Robert Muir added a comment - I think this one can be marked resolved? by the way, I reviewed the patches here: thanks for putting in all this effort! this fixes my biggest pet peeve about lucene....awesome.
          Hide
          Chris Male added a comment -

          Plan to commit this tomorrow.

          Show
          Chris Male added a comment - Plan to commit this tomorrow.
          Hide
          Chris Male added a comment -

          Patch which finalizes the remaining Analyzer changes and merges ReusableAnalyzerBase and Analyzer together.

          Command for the patch:

          # Keep old Analyzer.java temporarily for comparison
          svn rename lucene/src/java/org/apache/lucene/analysis/Analyzer.java lucene/src/java/org/apache/lucene/analysis/Analyzer.java.old
          svn rename lucene/src/java/org/apache/lucene/analysis/ReusableAnalyzerBase.java lucene/src/java/org/apache/lucene/analysis/Analyzer.java
          

          I have updated the javadocs to reflect the merging and have removed the assertFinal() assertion since tokenStream() and reusableTokenStream() are always final now.

          All tests pass and javadocs build. I am looking to commit this soon (with a MIGRATE.txt entry).

          Show
          Chris Male added a comment - Patch which finalizes the remaining Analyzer changes and merges ReusableAnalyzerBase and Analyzer together. Command for the patch: # Keep old Analyzer.java temporarily for comparison svn rename lucene/src/java/org/apache/lucene/analysis/Analyzer.java lucene/src/java/org/apache/lucene/analysis/Analyzer.java.old svn rename lucene/src/java/org/apache/lucene/analysis/ReusableAnalyzerBase.java lucene/src/java/org/apache/lucene/analysis/Analyzer.java I have updated the javadocs to reflect the merging and have removed the assertFinal() assertion since tokenStream() and reusableTokenStream() are always final now. All tests pass and javadocs build. I am looking to commit this soon (with a MIGRATE.txt entry).
          Hide
          Chris Male added a comment -

          Patch which converts the last of the Analyzers over to using ReusableAnalyzerBase. At this stage, RAB is now the only extension of Analyzer.

          Patch includes a few unique changes:

          • Adds AnalyzerWrapper which is used by Analyzers which wrap other Analyzers. This is necessary to allow access the TokenStreamComponents of the wrapped Analyzers, without making TSC public.
          • IndexSchemaRuntimeFieldTest is made out of IndexSchemaTest due to testRuntimeFieldCreation doing some thread-unsafe changes to the Analyzers stored in IndexSchema. When run in IndexSchemaTest, depending on the execution order, the test fails. When run by itself, there is no problems. I think this is okay because the actual code being tested is documented as being thread-unsafe and the test also notes it does some tacky things.

          I'm not going to look to commit this just yet, as I want to collapse Analyzer and RAB into a single class.

          Show
          Chris Male added a comment - Patch which converts the last of the Analyzers over to using ReusableAnalyzerBase. At this stage, RAB is now the only extension of Analyzer. Patch includes a few unique changes: Adds AnalyzerWrapper which is used by Analyzers which wrap other Analyzers. This is necessary to allow access the TokenStreamComponents of the wrapped Analyzers, without making TSC public. IndexSchemaRuntimeFieldTest is made out of IndexSchemaTest due to testRuntimeFieldCreation doing some thread-unsafe changes to the Analyzers stored in IndexSchema. When run in IndexSchemaTest, depending on the execution order, the test fails. When run by itself, there is no problems. I think this is okay because the actual code being tested is documented as being thread-unsafe and the test also notes it does some tacky things. I'm not going to look to commit this just yet, as I want to collapse Analyzer and RAB into a single class.
          Hide
          Chris Male added a comment -

          Committed revision 1169654.

          Show
          Chris Male added a comment - Committed revision 1169654.
          Hide
          Chris Male added a comment -

          During the merging of changes I lost some conversions of some test Analyzers.

          Patch addresses them (including the unholy TestPayloads).

          Show
          Chris Male added a comment - During the merging of changes I lost some conversions of some test Analyzers. Patch addresses them (including the unholy TestPayloads).
          Hide
          Chris Male added a comment -

          Committed revision 1169607.

          Now attacking the remaining Analyzers.

          Show
          Chris Male added a comment - Committed revision 1169607. Now attacking the remaining Analyzers.
          Hide
          Chris Male added a comment -

          Patch updated to trunk.

          We're good to go now.

          Show
          Chris Male added a comment - Patch updated to trunk. We're good to go now.
          Hide
          Chris Male added a comment -

          Patch updated to reset now returns void. I'll make sure to note this compat break in the CHANGES.txt.

          Show
          Chris Male added a comment - Patch updated to reset now returns void. I'll make sure to note this compat break in the CHANGES.txt.
          Hide
          Chris Male added a comment -

          Hmmm, I agree that we should change it to void. If the source cannot be reset, it should throw an Exception. We need to be able to rely on the fact that we are using reusable components.

          I'll update the patch.

          Show
          Chris Male added a comment - Hmmm, I agree that we should change it to void. If the source cannot be reset, it should throw an Exception. We need to be able to rely on the fact that we are using reusable components. I'll update the patch.
          Hide
          Robert Muir added a comment -

          patch looks great: one question, should we keep the 'reset can return false and we do not reuse' ?

          this seems like it might be obselete, though it would introduce an API break, I think maybe we should change it to void?

          Show
          Robert Muir added a comment - patch looks great: one question, should we keep the 'reset can return false and we do not reuse' ? this seems like it might be obselete, though it would introduce an API break, I think maybe we should change it to void?
          Hide
          Chris Male added a comment -

          I'm going to commit this soon and then work on the remaining Analyzers.

          Show
          Chris Male added a comment - I'm going to commit this soon and then work on the remaining Analyzers.
          Hide
          Chris Male added a comment -

          Patch updated to trunk.

          Generic <T> is removed from ReuseStrategy.

          Show
          Chris Male added a comment - Patch updated to trunk. Generic <T> is removed from ReuseStrategy.
          Hide
          Uwe Schindler added a comment -

          I agree its somehow overkill. But if not on class level I would even remove the T parameter from ther getter method, because it does not really fit, it is only used there, not even on the setter. There is no type enforcement anywhere, so the extra T is just to remove the casting on the caller of the protected method, but adding a SuppressWarnings on the implementor's side. So either make all T or use Object everywhere.

          Show
          Uwe Schindler added a comment - I agree its somehow overkill. But if not on class level I would even remove the T parameter from ther getter method, because it does not really fit, it is only used there, not even on the setter. There is no type enforcement anywhere, so the extra T is just to remove the casting on the caller of the protected method, but adding a SuppressWarnings on the implementor's side. So either make all T or use Object everywhere.
          Hide
          Chris Male added a comment - - edited

          Hi Uwe,

          I originally had ReuseStrategy with a generic type but then decided it was overkill since it only benefits implementations, not users of ReuseStrategy. If we want the extra type safety, I'll happily make the change.

          Show
          Chris Male added a comment - - edited Hi Uwe, I originally had ReuseStrategy with a generic type but then decided it was overkill since it only benefits implementations, not users of ReuseStrategy. If we want the extra type safety, I'll happily make the change.
          Hide
          Uwe Schindler added a comment -

          Hi, the ReuseStrategies look fine. I am just confused about the Generics. Why not make the whole abstract ReuseStrategy T-typed? Then also ThreadLocal<T> is used and no casting anywhere. The subclasses for PerField and global then are typed to the correct class (Map<> or TSComponents).

          Show
          Uwe Schindler added a comment - Hi, the ReuseStrategies look fine. I am just confused about the Generics. Why not make the whole abstract ReuseStrategy T-typed? Then also ThreadLocal<T> is used and no casting anywhere. The subclasses for PerField and global then are typed to the correct class (Map<> or TSComponents).
          Hide
          Simon Willnauer added a comment -

          this patch looks good to me! I like the reuse strategy and how you factored out the thread local stuff.
          I think we should commit and let bake this in?

          Show
          Simon Willnauer added a comment - this patch looks good to me! I like the reuse strategy and how you factored out the thread local stuff. I think we should commit and let bake this in?
          Hide
          Chris Male added a comment -

          Absolutely, I missed that fieldname usage.

          Updated patch. Still all green.

          Show
          Chris Male added a comment - Absolutely, I missed that fieldname usage. Updated patch. Still all green.
          Hide
          Robert Muir added a comment -

          just a quick glance: the MockAnalyzer needs to reuse-per-field?

          This way we test the case where payloads are enabled for one field but not for another.

          Show
          Robert Muir added a comment - just a quick glance: the MockAnalyzer needs to reuse-per-field? This way we test the case where payloads are enabled for one field but not for another.
          Hide
          Chris Male added a comment -

          Grrr TestPayloads does some really non-reusable stuff. The only two options I can see are either using a NoReuseStrategy implementation in the test, or instantiate the Analyzer again every time it does something crazy.

          I've chosen the latter since I don't want to start the NoReuseStrategy slippery slope.

          We're in good shape now (minus this stupidity).

          Show
          Chris Male added a comment - Grrr TestPayloads does some really non-reusable stuff. The only two options I can see are either using a NoReuseStrategy implementation in the test, or instantiate the Analyzer again every time it does something crazy. I've chosen the latter since I don't want to start the NoReuseStrategy slippery slope. We're in good shape now (minus this stupidity).
          Hide
          Chris Male added a comment -

          Improved patch which does some cleanup and documentation ReuseAnalyzerBase.ReuseStrategy.

          I'm seeing one test fail in TestPayloads, this class does some insanely non-reusable stuff with payloads. I'll keep digging.

          Show
          Chris Male added a comment - Improved patch which does some cleanup and documentation ReuseAnalyzerBase.ReuseStrategy. I'm seeing one test fail in TestPayloads, this class does some insanely non-reusable stuff with payloads. I'll keep digging.
          Hide
          Chris Male added a comment -

          This is the uber patch which converts all but 8 Analyzers over to using ReusableAnalyzerBase.

          This also introduces the ReuseStrategy concept to ReusableAnalyzerBase (will need some documentation before committing). For simplicity, the ReuseStrategy maintains its own CloseableThreadLocal.

          The remaining Analyzers can be converted when I push the ReuseAnalyzerBase logic into Analyzer.

          After some small tidy (and a review) I think we should commit this even though we're going to collapse ReusabeAnalyzerBase.

          Show
          Chris Male added a comment - This is the uber patch which converts all but 8 Analyzers over to using ReusableAnalyzerBase. This also introduces the ReuseStrategy concept to ReusableAnalyzerBase (will need some documentation before committing). For simplicity, the ReuseStrategy maintains its own CloseableThreadLocal. The remaining Analyzers can be converted when I push the ReuseAnalyzerBase logic into Analyzer. After some small tidy (and a review) I think we should commit this even though we're going to collapse ReusabeAnalyzerBase.
          Hide
          Chris Male added a comment -
          Show
          Chris Male added a comment - Done. LUCENE-3397
          Hide
          Robert Muir added a comment -

          yeah: just thinking it could be a nice cleanup? Some of these are messy

          Show
          Robert Muir added a comment - yeah: just thinking it could be a nice cleanup? Some of these are messy
          Hide
          Chris Male added a comment -

          Do you mean correcting the TokenStreams in tests so they are reusable?

          Show
          Chris Male added a comment - Do you mean correcting the TokenStreams in tests so they are reusable?
          Hide
          Robert Muir added a comment -

          can we create a separate sub-tasks to fix the tests (so we can backport at least a majority of the changes)?

          Show
          Robert Muir added a comment - can we create a separate sub-tasks to fix the tests (so we can backport at least a majority of the changes)?

            People

            • Assignee:
              Chris Male
              Reporter:
              Chris Male
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development