Lucene - Core
  1. Lucene - Core
  2. LUCENE-5803

Add another AnalyzerWrapper class that does not have its own cache, so delegate-only wrappers don't create thread local resources several times

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.9
    • Fix Version/s: 4.10, Trunk
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This is a followup issue for the following Elasticsearch issue: https://github.com/elasticsearch/elasticsearch/pull/6714

      Basically the problem is the following:

      • Elasticsearch has a pool of Analyzers that are used for analysis in several indexes
      • Each index uses a different PerFieldAnalyzerWrapper

      PerFieldAnalyzerWrapper uses PER_FIELD_REUSE_STRATEGY. Because of this it caches the tokenstreams for every field. If there are many fields, this are a lot. In addition, the underlying analyzers may also cache tokenstreams and other PerFieldAnalyzerWrappers do the same, although the delegate Analyzer can always return the same components.

      We should add similar code to Elasticsearch's directly to Lucene: If the delegating Analyzer just delegates per Field or just wraps CharFilters around the Reader, there is no need to cache the TokenStreamComponents a second time in the delegating Analyzers. This is only needed, if the delegating Analyzers adds additional TokenFilters (like ShingleAnalyzerWrapper).

      We should name this new class DelegatingAnalyzerWrapper extends AnalyzerWrapper. The wrapComponents method must be final, because we are not allowed to add additional TokenFilters, but unlike ES, we don't need to disallow wrapping with CharFilters.

      Internally this class uses a private ReuseStrategy that just delegates to the underlying analyzer. It does not matter here if the strategy of the delegate is global or per field, this is private to the delegate.

      1. LUCENE-5803.patch
        9 kB
        Uwe Schindler
      2. LUCENE-5803.patch
        9 kB
        Uwe Schindler
      3. LUCENE-5803.patch
        11 kB
        Uwe Schindler
      4. LUCENE-5803.patch
        14 kB
        Uwe Schindler
      5. LUCENE-5803.patch
        15 kB
        Uwe Schindler
      6. LUCENE-5803.patch
        16 kB
        Uwe Schindler

        Activity

        Hide
        Uwe Schindler added a comment -

        Thank you also to Shay Banon, who provided the original idea and patch inside the ES code tree.

        Show
        Uwe Schindler added a comment - Thank you also to Shay Banon, who provided the original idea and patch inside the ES code tree.
        Hide
        ASF subversion and git services added a comment -

        Commit 1608006 from Uwe Schindler in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1608006 ]

        Merged revision(s) 1608005 from lucene/dev/trunk:
        LUCENE-5803: Add more Javadocs to AnalyzerWrapper to encourage people to use DelegatingAnalyzerWrapper (if possible). Add changes entry for Solr.

        Show
        ASF subversion and git services added a comment - Commit 1608006 from Uwe Schindler in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1608006 ] Merged revision(s) 1608005 from lucene/dev/trunk: LUCENE-5803 : Add more Javadocs to AnalyzerWrapper to encourage people to use DelegatingAnalyzerWrapper (if possible). Add changes entry for Solr.
        Hide
        ASF subversion and git services added a comment -

        Commit 1608005 from Uwe Schindler in branch 'dev/trunk'
        [ https://svn.apache.org/r1608005 ]

        LUCENE-5803: Add more Javadocs to AnalyzerWrapper to encourage people to use DelegatingAnalyzerWrapper (if possible). Add changes entry for Solr.

        Show
        ASF subversion and git services added a comment - Commit 1608005 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1608005 ] LUCENE-5803 : Add more Javadocs to AnalyzerWrapper to encourage people to use DelegatingAnalyzerWrapper (if possible). Add changes entry for Solr.
        Hide
        ASF subversion and git services added a comment -

        Commit 1608003 from Uwe Schindler in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1608003 ]

        Merged revision(s) 1607998 from lucene/dev/trunk:
        LUCENE-5803: Add DelegatingAnalyzerWrapper, an optimized variant of AnalyzerWrapper that doesn't allow to wrap components or readers

        Show
        ASF subversion and git services added a comment - Commit 1608003 from Uwe Schindler in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1608003 ] Merged revision(s) 1607998 from lucene/dev/trunk: LUCENE-5803 : Add DelegatingAnalyzerWrapper, an optimized variant of AnalyzerWrapper that doesn't allow to wrap components or readers
        Hide
        ASF subversion and git services added a comment -

        Commit 1607998 from Uwe Schindler in branch 'dev/trunk'
        [ https://svn.apache.org/r1607998 ]

        LUCENE-5803: Add DelegatingAnalyzerWrapper, an optimized variant of AnalyzerWrapper that doesn't allow to wrap components or readers

        Show
        ASF subversion and git services added a comment - Commit 1607998 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1607998 ] LUCENE-5803 : Add DelegatingAnalyzerWrapper, an optimized variant of AnalyzerWrapper that doesn't allow to wrap components or readers
        Hide
        Uwe Schindler added a comment -

        New patch:

        • renamed a field and improved docs
        • replaced a horrible Solr offender: PreAnalyzedField was creating the analyzer over and over instead of creating it one time after field type init().

        I think this is ready.

        Show
        Uwe Schindler added a comment - New patch: renamed a field and improved docs replaced a horrible Solr offender: PreAnalyzedField was creating the analyzer over and over instead of creating it one time after field type init(). I think this is ready.
        Hide
        Uwe Schindler added a comment -

        Samuel García (‏https://twitter.com/samuelgmartinez) said on Twitter:

        @kimchy @thetaph1 we have a solr server cluster with 150 cores and about 20 indexed fields. We lost 1.5gb due to these zz_buffer tlocal.

        This patch will improve this situation, but not as good as in Elasticsearch. The difference in Solr is: Solr has a complete separation of cores (even with different classloader). Each core has its own schema with own field types. Every field type has its own analyzer. Those are combined in a PerFieldAnalyzer-like wrapper. If Solr would allow to define "field types" globally (across cores), this could be shared. But with crrent Solr, each core gets its own zz_buffer tlocal. The improvement in Solr due to this patch is:
        In the past we had a separate threadlocal per field name, because the AnalyzerWraper had a per-field-reuse strategy. With this patch we now have a global reuse strategy per FieldType. So the imporvement is: If you define a field type one time and reuse it for 20 fields, you have only one cached TokenStream, not 20. This is because we now delegate to the underlying Analyzer (the one from the field type), which has GLOBAL_REUSE_STRATEGY.

        Show
        Uwe Schindler added a comment - Samuel García (‏ https://twitter.com/samuelgmartinez ) said on Twitter: @kimchy @thetaph1 we have a solr server cluster with 150 cores and about 20 indexed fields. We lost 1.5gb due to these zz_buffer tlocal. This patch will improve this situation, but not as good as in Elasticsearch. The difference in Solr is: Solr has a complete separation of cores (even with different classloader). Each core has its own schema with own field types. Every field type has its own analyzer. Those are combined in a PerFieldAnalyzer-like wrapper. If Solr would allow to define "field types" globally (across cores), this could be shared. But with crrent Solr, each core gets its own zz_buffer tlocal. The improvement in Solr due to this patch is: In the past we had a separate threadlocal per field name , because the AnalyzerWraper had a per-field-reuse strategy. With this patch we now have a global reuse strategy per FieldType. So the imporvement is: If you define a field type one time and reuse it for 20 fields, you have only one cached TokenStream, not 20. This is because we now delegate to the underlying Analyzer (the one from the field type), which has GLOBAL_REUSE_STRATEGY.
        Hide
        Uwe Schindler added a comment -

        More improved test, does more checks by using a special non-delegating wrapper.

        Show
        Uwe Schindler added a comment - More improved test, does more checks by using a special non-delegating wrapper.
        Hide
        Uwe Schindler added a comment -

        I added a test to the analysis module that checks that wrapping (also mixed with default non-delegating AnalyzerWrapper) works as expected.

        Show
        Uwe Schindler added a comment - I added a test to the analysis module that checks that wrapping (also mixed with default non-delegating AnalyzerWrapper) works as expected.
        Hide
        Uwe Schindler added a comment -

        This is my new patch, that does not break the common pattern: new ShingleAnalyzerWrapper(new PerFieldAnalyzerWrapper(...)). The previous patch would have thrown IllegalStateException.

        In this patch, also the DelegatingAnalyzerWrapper gets a reuse strategy in its constructor, but this one is only used if you wrap a second time as fallback (this is why its called "defaultStrategy".

        Show
        Uwe Schindler added a comment - This is my new patch, that does not break the common pattern: new ShingleAnalyzerWrapper(new PerFieldAnalyzerWrapper(...)) . The previous patch would have thrown IllegalStateException. In this patch, also the DelegatingAnalyzerWrapper gets a reuse strategy in its constructor, but this one is only used if you wrap a second time as fallback (this is why its called "defaultStrategy".
        Hide
        Uwe Schindler added a comment -

        And for that reason, you shouldnt be able to wrap it with a charfilter. Use the existing subclass for "tweaking" the analyzer. Let this one be for pure delegation...

        I thought about that already and I agree. Although it is not really needed to explicitely forbid wrapping readers, we should still make the method final. Maybe in the future we will somehow also reuse the readers, so having a clear API contract here is a good idea.

        Attached is a new patch, but I have a better idea about the IllegalStateException.

        Show
        Uwe Schindler added a comment - And for that reason, you shouldnt be able to wrap it with a charfilter. Use the existing subclass for "tweaking" the analyzer. Let this one be for pure delegation... I thought about that already and I agree. Although it is not really needed to explicitely forbid wrapping readers, we should still make the method final. Maybe in the future we will somehow also reuse the readers, so having a clear API contract here is a good idea. Attached is a new patch, but I have a better idea about the IllegalStateException.
        Hide
        Robert Muir added a comment -

        And for that reason, you shouldnt be able to wrap it with a charfilter. Use the existing subclass for "tweaking" the analyzer. Let this one be for pure delegation...

        Show
        Robert Muir added a comment - And for that reason, you shouldnt be able to wrap it with a charfilter. Use the existing subclass for "tweaking" the analyzer. Let this one be for pure delegation...
        Hide
        Robert Muir added a comment -

        I think the name is fine myself. Its for delegation

        Show
        Robert Muir added a comment - I think the name is fine myself. Its for delegation
        Hide
        Uwe Schindler added a comment -

        I am not sure about the ideal name for this wrapper. Suggestions?

        Show
        Uwe Schindler added a comment - I am not sure about the ideal name for this wrapper. Suggestions?
        Hide
        Uwe Schindler added a comment -

        Patch.

        I added more Javadocs and tried to work around the stupid problem with the super constructor call cannot reference to this. There is the possibibility to do this by using the passed-in Analyzer, but then we loose the check throwing the IllegalStateException.

        We need this check, otherwise you would be able to corrumpt your analyzers: If you wrap this analyzer again with some other analyzer that uses the delegate reuse strategy, e.g., new ShingleAnalysisWrapper(new PerFieldAnalyzerWrapper(....)), the ShingleAnalysisWrapper would reuse the PerFieldAnalyzerWrapper's strategy (which is private to the PerFieldAnalysis wrapper) and by that inject illegal TokenStreamComponents into the inner's cache. So we must disallow this.

        This patch misses some tests for this special case and also to test if everything works fine.

        Solr is also using this Analyzer, so we see the improvements in Solr, too (not only in Elasticsearch). In fact, PER_FIELD_REUSE_STRATEGY is no longer used for pure per-field delegates. We no longer have one TokenStream instance per field, we have one instance per delegate Analyzer.

        Show
        Uwe Schindler added a comment - Patch. I added more Javadocs and tried to work around the stupid problem with the super constructor call cannot reference to this . There is the possibibility to do this by using the passed-in Analyzer, but then we loose the check throwing the IllegalStateException. We need this check, otherwise you would be able to corrumpt your analyzers: If you wrap this analyzer again with some other analyzer that uses the delegate reuse strategy, e.g., new ShingleAnalysisWrapper(new PerFieldAnalyzerWrapper(....)) , the ShingleAnalysisWrapper would reuse the PerFieldAnalyzerWrapper's strategy (which is private to the PerFieldAnalysis wrapper) and by that inject illegal TokenStreamComponents into the inner's cache. So we must disallow this. This patch misses some tests for this special case and also to test if everything works fine. Solr is also using this Analyzer, so we see the improvements in Solr, too (not only in Elasticsearch). In fact, PER_FIELD_REUSE_STRATEGY is no longer used for pure per-field delegates. We no longer have one TokenStream instance per field, we have one instance per delegate Analyzer.
        Hide
        Uwe Schindler added a comment -

        In fact, this makes PER_FIELD_REUSE_:STRATEGY for the PerFieldAnalyzerWrapper case obsolete, because in PerFieldAnalyzerWrapper we just leave the components caching up to the inner Analyzer, who can use GLOBAL or whatever else. This has the good effect, that we dont cache a TokenStream for every field, just for every delegate Analyzer.

        Show
        Uwe Schindler added a comment - In fact, this makes PER_FIELD_REUSE_:STRATEGY for the PerFieldAnalyzerWrapper case obsolete, because in PerFieldAnalyzerWrapper we just leave the components caching up to the inner Analyzer, who can use GLOBAL or whatever else. This has the good effect, that we dont cache a TokenStream for every field, just for every delegate Analyzer.

          People

          • Assignee:
            Uwe Schindler
            Reporter:
            Uwe Schindler
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development