Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.9, 6.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Changes like LUCENE-5634 make it clear that the default AttributeFactory stuff has a very high cost: weakmaps/reflection/etc.

      Additionally I think clearAttributes() is more expensive than it should be: it has to traverse a linked-list, calling clear() per token.

      Operations like cloning (save/restoreState) have a high cost tll.

      Maybe we can have a better Default? In other words, rename DEFAULT_ATTRIBUTE_FACTORY to REFLECTION_ATTRIBUTE_FACTORY, and instead have a faster default factory that just has one AttributeImpl with the "core ones" that 95% of users are dealing with (TOKEN_ATTRIBUTE_FACTORY?): anything outside of that falls back to reflection.

      1. LUCENE-5638.patch
        15 kB
        Robert Muir
      2. LUCENE-5638.patch
        14 kB
        Robert Muir
      3. LUCENE-5638.patch
        19 kB
        Robert Muir
      4. LUCENE-5638.patch
        18 kB
        Robert Muir
      5. LUCENE-5638.patch
        0.5 kB
        Uwe Schindler

        Activity

        Hide
        Uwe Schindler added a comment -

        Changes like LUCENE-5634 make it clear that the default AttributeFactory stuff has a very high cost: weakmaps/reflection/etc.

        The problem are not the weak maps and reflections. The reason why it is expensive is the fact that all attribute instances have to be put into the 2 LinkedHashMaps on creating the TokenStream. I just repeat: It is not the refection! We had this discussion already back 5 years ago with Michael Busch!

        In addition, the AttributeFactory itsself has less impact (this was already tested while developing it in 2.9). This is why the weak maps are there - so it is fast, the only reflection ever happens is: Class#newInstance() is cheap in recent Java versions, the speed difference in micro benchmarks is small, as fast as a native new.

        So I disagree with removing the default AttributeFactory, we still need it for non-default attributes, so: The simple workaround would be to use TOKEN_ATTRIBUTE_FACTORY instead, which falls back to the default one for unknown attributes.

        I agree with clearAttributes(), but this should be solved with TOKEN_ATTRIBUTE_FACTORY , too.

        Show
        Uwe Schindler added a comment - Changes like LUCENE-5634 make it clear that the default AttributeFactory stuff has a very high cost: weakmaps/reflection/etc. The problem are not the weak maps and reflections. The reason why it is expensive is the fact that all attribute instances have to be put into the 2 LinkedHashMaps on creating the TokenStream. I just repeat: It is not the refection! We had this discussion already back 5 years ago with Michael Busch! In addition, the AttributeFactory itsself has less impact (this was already tested while developing it in 2.9). This is why the weak maps are there - so it is fast, the only reflection ever happens is: Class#newInstance() is cheap in recent Java versions, the speed difference in micro benchmarks is small, as fast as a native new . So I disagree with removing the default AttributeFactory, we still need it for non-default attributes, so: The simple workaround would be to use TOKEN_ATTRIBUTE_FACTORY instead, which falls back to the default one for unknown attributes. I agree with clearAttributes(), but this should be solved with TOKEN_ATTRIBUTE_FACTORY , too.
        Hide
        Uwe Schindler added a comment -

        Easy and simple one-line patch.

        This uses the Token class as attributes impl, which supports:

        public class Token extends CharTermAttributeImpl 
                           implements TypeAttribute, PositionIncrementAttribute,
                                      FlagsAttribute, OffsetAttribute, PayloadAttribute, PositionLengthAttribute {
        

        Strangely, this test fails:

        [junit4] Tests with failures:
        [junit4]   - org.apache.lucene.analysis.TestGraphTokenizers.testMockGraphTokenFilterOnGraphInput
        [junit4]
        

        So this one seems to catch some bug in Token.java or the test does not work with this attribute impl (maybe it copies/clones in a wrong way).

        Show
        Uwe Schindler added a comment - Easy and simple one-line patch. This uses the Token class as attributes impl, which supports: public class Token extends CharTermAttributeImpl implements TypeAttribute, PositionIncrementAttribute, FlagsAttribute, OffsetAttribute, PayloadAttribute, PositionLengthAttribute { Strangely, this test fails: [junit4] Tests with failures: [junit4] - org.apache.lucene.analysis.TestGraphTokenizers.testMockGraphTokenFilterOnGraphInput [junit4] So this one seems to catch some bug in Token.java or the test does not work with this attribute impl (maybe it copies/clones in a wrong way).
        Hide
        Uwe Schindler added a comment -

        I found the bug: Token implemented PositionLengthAttribute but missed to implement all the clone/copyTo/equals/... shit. I willheavy commit that, because its a bug.

        Show
        Uwe Schindler added a comment - I found the bug: Token implemented PositionLengthAttribute but missed to implement all the clone/copyTo/equals/... shit. I willheavy commit that, because its a bug.
        Hide
        Uwe Schindler added a comment -

        I created a subtask: LUCENE-5639

        Show
        Uwe Schindler added a comment - I created a subtask: LUCENE-5639
        Hide
        Uwe Schindler added a comment -

        In analysis module, TestWikipediaTokenizer also fails, we have to dig. I don't understand the failure.

        Show
        Uwe Schindler added a comment - In analysis module, TestWikipediaTokenizer also fails, we have to dig. I don't understand the failure.
        Hide
        Uwe Schindler added a comment -

        I also created another subtask to clean up the Token class and remove stupid copy-ctors and all those reinit() methods. Unmaintainable! LUCENE-5640

        Show
        Uwe Schindler added a comment - I also created another subtask to clean up the Token class and remove stupid copy-ctors and all those reinit() methods. Unmaintainable! LUCENE-5640
        Hide
        Robert Muir added a comment -

        I benchmarked this patch against trunk with luceneutil's TestAnalyzerPerf.java.

        I ran 3 runs of it against both trunk and patch (because its quite noisy), but I think it shows a trend:
        especially for ones like Shingles doing lots of saveState()/restoreState() and ones like n-grams calling clearAttributes() many times.

        Times are reported in milliseconds to analyze en-wiki linedocs file: lower is better.

        Standard Run 1 Run 2 Run 3
        trunk 22983.37 23680.52 27390.50
        patch 21303.90 21419.21 21351.63
        LowerCase Run 1 Run 2 Run 3
        trunk 18032.37 17795.05 17533.72
        patch 16688.78 17267.51 16650.72
        EdgeNGrams Run 1 Run 2 Run 3
        trunk 40774.93 38842.67 53023.12
        patch 30798.36 31389.50 31799.47
        Shingles Run 1 Run 2 Run 3
        trunk 75702.71 75594.85 82081.14
        patch 49572.20 58220.35 48895.09
        WordDelimiter Run 1 Run 2 Run 3
        trunk 40101.05 38491.25 46736.42
        patch 30995.33 31743.91 30488.61

        About the patch: there are more Tokenizers in the analysis/ module referring to DEFAULT_ATTRIBUTE_FACTORY directly today (because they all have ctors to pass in a custom factory). So we should fix those too.

        Show
        Robert Muir added a comment - I benchmarked this patch against trunk with luceneutil's TestAnalyzerPerf.java. I ran 3 runs of it against both trunk and patch (because its quite noisy), but I think it shows a trend: especially for ones like Shingles doing lots of saveState()/restoreState() and ones like n-grams calling clearAttributes() many times. Times are reported in milliseconds to analyze en-wiki linedocs file: lower is better. Standard Run 1 Run 2 Run 3 trunk 22983.37 23680.52 27390.50 patch 21303.90 21419.21 21351.63 LowerCase Run 1 Run 2 Run 3 trunk 18032.37 17795.05 17533.72 patch 16688.78 17267.51 16650.72 EdgeNGrams Run 1 Run 2 Run 3 trunk 40774.93 38842.67 53023.12 patch 30798.36 31389.50 31799.47 Shingles Run 1 Run 2 Run 3 trunk 75702.71 75594.85 82081.14 patch 49572.20 58220.35 48895.09 WordDelimiter Run 1 Run 2 Run 3 trunk 40101.05 38491.25 46736.42 patch 30995.33 31743.91 30488.61 About the patch: there are more Tokenizers in the analysis/ module referring to DEFAULT_ATTRIBUTE_FACTORY directly today (because they all have ctors to pass in a custom factory). So we should fix those too.
        Hide
        Robert Muir added a comment -

        Updated patch: I added it to other tokenizers, i also got MockTokenizer using it by default and randomize between DEFAULT and TOKEN attribute factories in TestRandomChains.

        There were some test failures in TestPayloads, because they asserted PayloadAttribute was not present. I just removed the assertions.

        I will look for other failures and also inspect FreqProxTermsWriter to ensure it is optimized for the case where payloadAtt != null but returns null every time, and benchmark indexing somehow with the patch.

        Show
        Robert Muir added a comment - Updated patch: I added it to other tokenizers, i also got MockTokenizer using it by default and randomize between DEFAULT and TOKEN attribute factories in TestRandomChains. There were some test failures in TestPayloads, because they asserted PayloadAttribute was not present. I just removed the assertions. I will look for other failures and also inspect FreqProxTermsWriter to ensure it is optimized for the case where payloadAtt != null but returns null every time, and benchmark indexing somehow with the patch.
        Hide
        Robert Muir added a comment -

        Core tests are passing. Some analysis/common tests fail:

           [junit4] Tests with failures:
           [junit4]   - org.apache.lucene.analysis.wikipedia.WikipediaTokenizerTest.testRandomStrings
           [junit4]   - org.apache.lucene.analysis.wikipedia.WikipediaTokenizerTest.testRandomHugeStrings
           [junit4]   - org.apache.lucene.analysis.miscellaneous.TestWordDelimiterFilter.testRandomStrings
           [junit4]   - org.apache.lucene.analysis.miscellaneous.TestWordDelimiterFilter.testRandomHugeStrings
           [junit4]   - org.apache.lucene.analysis.path.TestPathHierarchyTokenizer.testRandomStrings
           [junit4]   - org.apache.lucene.analysis.path.TestPathHierarchyTokenizer.testRandomHugeStrings
           [junit4]   - org.apache.lucene.analysis.path.TestReversePathHierarchyTokenizer.testRandomStrings
           [junit4]   - org.apache.lucene.analysis.path.TestReversePathHierarchyTokenizer.testRandomHugeStrings
        

        Looking into these now...

        Show
        Robert Muir added a comment - Core tests are passing. Some analysis/common tests fail: [junit4] Tests with failures: [junit4] - org.apache.lucene.analysis.wikipedia.WikipediaTokenizerTest.testRandomStrings [junit4] - org.apache.lucene.analysis.wikipedia.WikipediaTokenizerTest.testRandomHugeStrings [junit4] - org.apache.lucene.analysis.miscellaneous.TestWordDelimiterFilter.testRandomStrings [junit4] - org.apache.lucene.analysis.miscellaneous.TestWordDelimiterFilter.testRandomHugeStrings [junit4] - org.apache.lucene.analysis.path.TestPathHierarchyTokenizer.testRandomStrings [junit4] - org.apache.lucene.analysis.path.TestPathHierarchyTokenizer.testRandomHugeStrings [junit4] - org.apache.lucene.analysis.path.TestReversePathHierarchyTokenizer.testRandomStrings [junit4] - org.apache.lucene.analysis.path.TestReversePathHierarchyTokenizer.testRandomHugeStrings Looking into these now...
        Hide
        Robert Muir added a comment -

        Now that tests are sorted out: here is an updated patch.

        Show
        Robert Muir added a comment - Now that tests are sorted out: here is an updated patch.
        Hide
        Robert Muir added a comment -

        Updated patch: all tests (and precommit) pass.

        I benchmarked indexing with various types of fields, the change actually gives a speedup even to StringFields (i suppose from faster clearAttributes?)

        This is ready to commit.

        Show
        Robert Muir added a comment - Updated patch: all tests (and precommit) pass. I benchmarked indexing with various types of fields, the change actually gives a speedup even to StringFields (i suppose from faster clearAttributes?) This is ready to commit.
        Hide
        Uwe Schindler added a comment -

        +1 We shoudl fix the Token subtask, too. Most of work is done, but maybe we split Token up.

        And: We should remove FlagsAttrbute!

        Show
        Uwe Schindler added a comment - +1 We shoudl fix the Token subtask, too. Most of work is done, but maybe we split Token up. And: We should remove FlagsAttrbute!
        Hide
        ASF subversion and git services added a comment -

        Commit 1592353 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1592353 ]

        LUCENE-5638: pack the core attributes into one impl by default

        Show
        ASF subversion and git services added a comment - Commit 1592353 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1592353 ] LUCENE-5638 : pack the core attributes into one impl by default
        Hide
        ASF subversion and git services added a comment -

        Commit 1592357 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1592357 ]

        LUCENE-5638: pack the core attributes into one impl by default

        Show
        ASF subversion and git services added a comment - Commit 1592357 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1592357 ] LUCENE-5638 : pack the core attributes into one impl by default
        Hide
        Uwe Schindler added a comment -

        More improvements to the DefaultAttributeFactory are developed in the subtask: LUCENE-5640 (using Java 7 MethodHandles)

        Show
        Uwe Schindler added a comment - More improvements to the DefaultAttributeFactory are developed in the subtask: LUCENE-5640 (using Java 7 MethodHandles)

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development