Lucene - Core
  1. Lucene - Core
  2. LUCENE-5638 Default Attributes are expensive
  3. LUCENE-5640

Cleanup & deprecate Token class / Improve default AttributeFactory to no longer use reflection

    Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.9, 6.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      We should remove code duplication in the Token class:

      • copy constructors
      • reinit() shit
      • non-default clone()

      This is too bugy. Most of the methods can be simply removed.

      This issue will also factor out the basic attributes to a separate implementation class (without the Token extra stuff). Token then just extends this class (in the tokenattributes package) and adds the additional stuff not needed for an Attribute. Token itsself will get deprecated.

      Also part of the slowdown in the parent issue is caused by ineffective DefaultAttributeFactory, which uses reflection to instantiate new attributes. As we are on Java 7 now, we can use MethodHandle to do this. MethodHandle does access checks only once on creating the factory or when the attribute is seen first. Later invocations are done without any result type conversions and parameter conversions as a statically linked constructor call. This greatly improves performance with Java 8, Java 7 is approx as fast, unless its completely static.

      1. LUCENE-5640.patch
        131 kB
        Uwe Schindler
      2. LUCENE-5640.patch
        114 kB
        Uwe Schindler
      3. LUCENE-5640.patch
        109 kB
        Uwe Schindler
      4. LUCENE-5640.patch
        106 kB
        Uwe Schindler
      5. LUCENE-5640.patch
        106 kB
        Uwe Schindler
      6. LUCENE-5640.patch
        102 kB
        Uwe Schindler
      7. LUCENE-5640.patch
        14 kB
        Robert Muir
      8. LUCENE-5640-4x.patch
        141 kB
        Uwe Schindler

        Activity

        Hide
        Robert Muir added a comment -

        The start to a patch. Don't worry about tests, i will clean those up on LUCENE-5642 first.

        Show
        Robert Muir added a comment - The start to a patch. Don't worry about tests, i will clean those up on LUCENE-5642 first.
        Hide
        Uwe Schindler added a comment - - edited

        I think we should factor out the attribute stuff "non-token" from Token class and move it as class inbetween: o.a.l.analysis.tokenattributes.CombinedTokenAttributeImpl

        Token is the just a subclass of this one and add the highlighter-needed crazy stuff. The basic impl is clean and straight-forward. Also we should remove FlagsAttribute from it and let only implement it in Token only (or nuke the whole attribute completely).

        Show
        Uwe Schindler added a comment - - edited I think we should factor out the attribute stuff "non-token" from Token class and move it as class inbetween: o.a.l.analysis.tokenattributes.CombinedTokenAttributeImpl Token is the just a subclass of this one and add the highlighter-needed crazy stuff. The basic impl is clean and straight-forward. Also we should remove FlagsAttribute from it and let only implement it in Token only (or nuke the whole attribute completely).
        Hide
        Robert Muir added a comment -

        That sounds great: though Token is pretty minimal here already, its just one ctor taking CharSequence,int,int that is used?

        Show
        Robert Muir added a comment - That sounds great: though Token is pretty minimal here already, its just one ctor taking CharSequence,int,int that is used?
        Hide
        Uwe Schindler added a comment -

        Yeah, it is more cosmetic. I just want the impl and AttributeFactory part of the tokenattributes package - and look like a attributeImpl: *Impl + only public no-arg ctor.

        Show
        Uwe Schindler added a comment - Yeah, it is more cosmetic. I just want the impl and AttributeFactory part of the tokenattributes package - and look like a attributeImpl: *Impl + only public no-arg ctor.
        Hide
        Uwe Schindler added a comment -

        Hi Robert: Do you think we need PayloadAttribute in the default set.

        I started to prepare a patch (based on your's) that refactors the basic impls away into a separate Impl class. I removed FlagsAttribute already, but PayloadAttribute is a different thing to look at. I don't think it should be in the default attributes!

        Show
        Uwe Schindler added a comment - Hi Robert: Do you think we need PayloadAttribute in the default set. I started to prepare a patch (based on your's) that refactors the basic impls away into a separate Impl class. I removed FlagsAttribute already, but PayloadAttribute is a different thing to look at. I don't think it should be in the default attributes!
        Hide
        Robert Muir added a comment -

        I really think it should, because the indexer must look for it for every token, for fields with positions enabled.

        Show
        Robert Muir added a comment - I really think it should, because the indexer must look for it for every token, for fields with positions enabled.
        Hide
        Robert Muir added a comment -

        OK I think i may be wrong, the indexer does this:

        payloadAttribute = attributeSource.getAttribute(PayloadAttribute.class);
        

        So this will not do anything right? its just a hashmap lookup?

        Show
        Robert Muir added a comment - OK I think i may be wrong, the indexer does this: payloadAttribute = attributeSource.getAttribute(PayloadAttribute.class); So this will not do anything right? its just a hashmap lookup?
        Hide
        Uwe Schindler added a comment -

        Yes!

        Show
        Uwe Schindler added a comment - Yes!
        Hide
        Uwe Schindler added a comment -

        Here is a (quite large) patch that refactors a bit more:

        • Move AttributeSource.AttributeFactory to a top-level class (this makes most of the patch)
        • Add a special AttributeFactory that can be used to implement factories that return a specific class instance for a set of attributes.
        • In AttrbuteFactory cache the constructor instances of all AttributeImpls. This spares looking up the constructor in addAttribute(). The trick with SoftReference was inspired by the JDK. Theoretically, the reference may not be needed at all if impl and attribute are from same classloader..
        • Factored out the basic attributes to PackedTokenAttributeImpl (no flags and payloads)
        • Token got deprecated (yippee) and additionally implements FlagsAttribute and PayloadAttribute
        Show
        Uwe Schindler added a comment - Here is a (quite large) patch that refactors a bit more: Move AttributeSource.AttributeFactory to a top-level class (this makes most of the patch) Add a special AttributeFactory that can be used to implement factories that return a specific class instance for a set of attributes. In AttrbuteFactory cache the constructor instances of all AttributeImpls. This spares looking up the constructor in addAttribute(). The trick with SoftReference was inspired by the JDK. Theoretically, the reference may not be needed at all if impl and attribute are from same classloader.. Factored out the basic attributes to PackedTokenAttributeImpl (no flags and payloads) Token got deprecated (yippee) and additionally implements FlagsAttribute and PayloadAttribute
        Hide
        Robert Muir added a comment -

        This patch has the same performance as trunk according to luceneutil's TestAnalyzerPerf.

        Show
        Robert Muir added a comment - This patch has the same performance as trunk according to luceneutil's TestAnalyzerPerf.
        Hide
        Uwe Schindler added a comment -

        New patch, which also improves the AttributeFactory: it no longer uses reflection to create the instances, instead Java 7 precompiled MethodHandles which are compiled to bytecode by the JVM.

        This one adds rethrow() again, because it is needed for MethodHandles to not loose Exceptions. The given implementation of rethrow() is private to AttributeSource, but works without instantiating (its same trick, but shorter and more elegant).

        Show
        Uwe Schindler added a comment - New patch, which also improves the AttributeFactory: it no longer uses reflection to create the instances, instead Java 7 precompiled MethodHandles which are compiled to bytecode by the JVM. This one adds rethrow() again, because it is needed for MethodHandles to not loose Exceptions. The given implementation of rethrow() is private to AttributeSource, but works without instantiating (its same trick, but shorter and more elegant).
        Hide
        Robert Muir added a comment -

        I benchmarked this against trunk, by hacking both checkouts to have a NoReuseStrategy... this patch is significantly (like 30-50%) slower than trunk.

        I think this java7 methodhandles stuff is very slow, it totally dominates the cpu...

        CPU SAMPLES BEGIN (total = 30511) Sun May  4 21:01:17 2014
        rank   self  accum   count trace method
           1 66.09% 66.09%   20164 300743 java.lang.invoke.MethodHandleNatives.init
           2  6.73% 72.81%    2052 300808 java.lang.Object.notifyAll
           3  6.44% 79.25%    1965 300823 java.lang.invoke.MethodHandleNatives.init
           4  0.95% 80.20%     289 300863 org.apache.lucene.codecs.compressing.LZ4.compress
           5  0.88% 81.08%     269 301196 org.apache.lucene.index.FreqProxFields$FreqProxDocsAndPositionsEnum.reset
        

        on the other hand with trunk, its of course slower than when we are reusing by far, but you dont see this stuff:

        CPU SAMPLES BEGIN (total = 23623) Sun May  4 21:05:15 2014
        rank   self  accum   count trace method
           1 13.48% 13.48%    3184 300667 java.lang.Object.notifyAll
           2 13.23% 26.71%    3125 300741 org.apache.lucene.analysis.core.LowerCaseFilter.<init>
           3  4.06% 30.77%     959 300768 java.util.LinkedHashMap.addEntry
           4  1.98% 32.75%     468 300767 org.apache.lucene.util.AttributeSource.addAttributeImpl
           5  1.60% 34.35%     379 300810 org.apache.lucene.util.BytesRefHash.add
           6  1.46% 35.81%     345 300763 org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill
           7  1.35% 37.17%     320 300812 org.apache.lucene.util.packed.AbstractAppendingLongBuffer.size
        

        "just say no to MethodHandle"... good to know

        Show
        Robert Muir added a comment - I benchmarked this against trunk, by hacking both checkouts to have a NoReuseStrategy... this patch is significantly (like 30-50%) slower than trunk. I think this java7 methodhandles stuff is very slow, it totally dominates the cpu... CPU SAMPLES BEGIN (total = 30511) Sun May 4 21:01:17 2014 rank self accum count trace method 1 66.09% 66.09% 20164 300743 java.lang.invoke.MethodHandleNatives.init 2 6.73% 72.81% 2052 300808 java.lang.Object.notifyAll 3 6.44% 79.25% 1965 300823 java.lang.invoke.MethodHandleNatives.init 4 0.95% 80.20% 289 300863 org.apache.lucene.codecs.compressing.LZ4.compress 5 0.88% 81.08% 269 301196 org.apache.lucene.index.FreqProxFields$FreqProxDocsAndPositionsEnum.reset on the other hand with trunk, its of course slower than when we are reusing by far, but you dont see this stuff: CPU SAMPLES BEGIN (total = 23623) Sun May 4 21:05:15 2014 rank self accum count trace method 1 13.48% 13.48% 3184 300667 java.lang.Object.notifyAll 2 13.23% 26.71% 3125 300741 org.apache.lucene.analysis.core.LowerCaseFilter.<init> 3 4.06% 30.77% 959 300768 java.util.LinkedHashMap.addEntry 4 1.98% 32.75% 468 300767 org.apache.lucene.util.AttributeSource.addAttributeImpl 5 1.60% 34.35% 379 300810 org.apache.lucene.util.BytesRefHash.add 6 1.46% 35.81% 345 300763 org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill 7 1.35% 37.17% 320 300812 org.apache.lucene.util.packed.AbstractAppendingLongBuffer.size "just say no to MethodHandle"... good to know
        Hide
        Uwe Schindler added a comment -

        New patch using invokeExact.

        Show
        Uwe Schindler added a comment - New patch using invokeExact.
        Hide
        Robert Muir added a comment -

        Now very close to trunk (only something like 2-3% difference, but doesnt look like noise).

        No strange stuff in profiler anymore. Also maybe it is someting else in the patch, i did not test the previous reflection-based one in this way yet, I am only comparing to trunk.

        Show
        Robert Muir added a comment - Now very close to trunk (only something like 2-3% difference, but doesnt look like noise). No strange stuff in profiler anymore. Also maybe it is someting else in the patch, i did not test the previous reflection-based one in this way yet, I am only comparing to trunk.
        Hide
        Uwe Schindler added a comment - - edited

        Hi Robert, I think the small difference is explained:

        In trunk we currently use Token class and the TokenAttributeFactory is creating it with new Token() after the isAssignableFrom() so there is no overhead at all. The latest patch is using a more generic factory with invokeExact. FYI, isAssignableFrom() is iintrinsic.

        If you compare the reflection-based patch (not trunk) with this one, you should see a large improvement; if you compare trunk with this, maybe a small degradion (which is what you see) because of extra invokeExact overhead. This code is now similar to Java 8's closures that use the "::" operator.

        In any case, as we reuse TokenStreams this was just a hard check of what happens if you don't reuse, so the latest patch is a huge improvement.

        Show
        Uwe Schindler added a comment - - edited Hi Robert, I think the small difference is explained: In trunk we currently use Token class and the TokenAttributeFactory is creating it with new Token() after the isAssignableFrom() so there is no overhead at all. The latest patch is using a more generic factory with invokeExact. FYI, isAssignableFrom() is iintrinsic. If you compare the reflection-based patch (not trunk) with this one, you should see a large improvement; if you compare trunk with this, maybe a small degradion (which is what you see) because of extra invokeExact overhead. This code is now similar to Java 8's closures that use the "::" operator. In any case, as we reuse TokenStreams this was just a hard check of what happens if you don't reuse, so the latest patch is a huge improvement.
        Hide
        Uwe Schindler added a comment -

        In addition, the code we now have is like LambdaMetaFactory in Java 8: The code used by the Token attribute factory or the PackedAttributeFactory is now:

        public static final AttributeFactory DEFAULT_TOKEN_ATTRIBUTE_FACTORY =
          AttributeFactory.getStaticImplementation(AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY, PackedTokenAttributeImpl.class);
        

        whose abstract newInstance() method does the same like a closure with PackedTokenAttributeImpl::new in Java 8. The trick here is to get the constructor non-reflective and call it with dynamic linking. So this is really the correct way to do this.

        This will improve also the speed with other addAttributes() where we don't have the combined factory, like in FuzzyQuery or whenever we add single attributes.

        The small overhead in the example (I don't see it in Java 8) that Robert measured could be removed by replacing the above code snippet with:

        public static final AttributeFactory DEFAULT_TOKEN_ATTRIBUTE_FACTORY =
          new AttributeFactory.StaticImplementationAttributeFactory<>(AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY, PackedTokenAttributeImpl.class) {
          @Override
          public PackedTokenAttributeImpl newInstance() { return new PackedTokenAttributeImpl(); }
        };
        

        But this makes code less readable, especially when we add new packed attribute impls (we could think of providing them for various other combinations). As said, the very small lambda overhead is neglectible, because we reuse.

        It would also be interesting to compare the good old non-packed AttributeImpl with that approach! I think we have an improvement here, too, but packing is still better (we spare the reflective call, but we still have to populate the maps).

        Show
        Uwe Schindler added a comment - In addition, the code we now have is like LambdaMetaFactory in Java 8: The code used by the Token attribute factory or the PackedAttributeFactory is now: public static final AttributeFactory DEFAULT_TOKEN_ATTRIBUTE_FACTORY = AttributeFactory.getStaticImplementation(AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY, PackedTokenAttributeImpl.class); whose abstract newInstance() method does the same like a closure with PackedTokenAttributeImpl::new in Java 8. The trick here is to get the constructor non-reflective and call it with dynamic linking. So this is really the correct way to do this. This will improve also the speed with other addAttributes() where we don't have the combined factory, like in FuzzyQuery or whenever we add single attributes. The small overhead in the example (I don't see it in Java 8) that Robert measured could be removed by replacing the above code snippet with: public static final AttributeFactory DEFAULT_TOKEN_ATTRIBUTE_FACTORY = new AttributeFactory.StaticImplementationAttributeFactory<>(AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY, PackedTokenAttributeImpl.class) { @Override public PackedTokenAttributeImpl newInstance() { return new PackedTokenAttributeImpl(); } }; But this makes code less readable, especially when we add new packed attribute impls (we could think of providing them for various other combinations). As said, the very small lambda overhead is neglectible, because we reuse. It would also be interesting to compare the good old non-packed AttributeImpl with that approach! I think we have an improvement here, too, but packing is still better (we spare the reflective call, but we still have to populate the maps).
        Hide
        Uwe Schindler added a comment -

        Cleanups:

        • fix javadoc error
        • improve documentation
        • improve MockUTF16TermAttributeImpl (including use of StandardCharsets)

        I think this is ready, if Robert agrees with my explanation I did before.

        While running tests, TestRandomChains also failed for me with PathTokenizer, but this also happened to jenkins, so its unrelated to the changes here.

        Show
        Uwe Schindler added a comment - Cleanups: fix javadoc error improve documentation improve MockUTF16TermAttributeImpl (including use of StandardCharsets) I think this is ready, if Robert agrees with my explanation I did before. While running tests, TestRandomChains also failed for me with PathTokenizer, but this also happened to jenkins, so its unrelated to the changes here.
        Hide
        Robert Muir added a comment -

        But this makes code less readable

        But prevents things from being slower! Can we please retain the "POJO" construction of Token? Doesn't it also impact things like save/restoreState and cloning?

        Show
        Robert Muir added a comment - But this makes code less readable But prevents things from being slower! Can we please retain the "POJO" construction of Token? Doesn't it also impact things like save/restoreState and cloning?
        Hide
        Uwe Schindler added a comment - - edited

        Doesn't it also impact things like save/restoreState and cloning?

        No. Save/restore state does not add/remove attributes. cloneAttributes() creates a deep clone of the internal maps, so it just copies what is already there. AttributeFactory is not used.

        This code only affects addAttribute - so constructing new TokenStreams, if the attribute is not already there. And this is only done on construction. This patch largely improves adding non-common attributes like KeywordAttribute or PayloadAttribute. The PackedAttributeImpl is the same as before, we can still talk about the explicit new, although I disagree here.

        This is like other discussions: Code duplication is not good just to win 1% of performance in a micro-benchmark. This is what you always telling Mike, too.

        Show
        Uwe Schindler added a comment - - edited Doesn't it also impact things like save/restoreState and cloning? No. Save/restore state does not add/remove attributes. cloneAttributes() creates a deep clone of the internal maps, so it just copies what is already there. AttributeFactory is not used. This code only affects addAttribute - so constructing new TokenStreams, if the attribute is not already there. And this is only done on construction. This patch largely improves adding non-common attributes like KeywordAttribute or PayloadAttribute. The PackedAttributeImpl is the same as before, we can still talk about the explicit new , although I disagree here. This is like other discussions: Code duplication is not good just to win 1% of performance in a micro-benchmark. This is what you always telling Mike, too.
        Hide
        Robert Muir added a comment -

        This is not really code duplication though, its the difference between doing something basic (like creating a new POJO in the old-fashioned way), and doing something complicated (like new untested MethodHandle/invokedynamic/etc/etc). One of these has been around for a while and we know how it behaves, the other has not.

        The attributes are thousands of lines of very complicated code: WeakHashMaps/SoftReferences/Reflection,etc. I'm only saying, if you can add those 3 lines so we don't get slower, that seems like a no brainer.

        Show
        Robert Muir added a comment - This is not really code duplication though, its the difference between doing something basic (like creating a new POJO in the old-fashioned way), and doing something complicated (like new untested MethodHandle/invokedynamic/etc/etc). One of these has been around for a while and we know how it behaves, the other has not. The attributes are thousands of lines of very complicated code: WeakHashMaps/SoftReferences/Reflection,etc. I'm only saying, if you can add those 3 lines so we don't get slower, that seems like a no brainer.
        Hide
        Michael McCandless added a comment -

        I haven't looked closely here, but why does simplifying Token.java required new hairy things like MethodHandle / getConstructorForInterface / etc.?

        Show
        Michael McCandless added a comment - I haven't looked closely here, but why does simplifying Token.java required new hairy things like MethodHandle / getConstructorForInterface / etc.?
        Hide
        Uwe Schindler added a comment -

        The MethodHandle code is also used to create the attributes in the DEFAULT_ATTRIBUTE_FACTORY. And this improves a lot in comparison to the old reflection based approach.

        I am sorry for the first patch yesterday to be slower. In fact it was unfair, was just to late in the evening (4 am in the morning):

        • I did not use invokeExact and because of that, unter the hood it used the similar code like reflection.
        • The unfair part was that the stupid bug I had was to implicit use reflection to create the Token instance, and this was bad.
        • With invokeExact its now identical to all code around for sorting and lists of Java 8. All code there uses the exact same pattern, generated by the javac compiler.
        • Invokedynamic is not involved at all. Invokedynamic is something else, here we have hard linked ctor calls.

        You standpoint is like forbidding this syntax in Java 8: Arrays.sort(array, Integer::compare); - this is heavily tested and as fast as Arrays.sort(array, new Comparator() { ... }; This is just some hotspot noise.

        It would be really nice (instead of complaining here about nonsense, sorry), if we could check the performance of this with non-packed AttributeImpls. E.g. for TokenStreams adding KeywordAttribute or PayloadAttribute or JapaneseFoobarAttribute. Here we should see a large improvement in contrast to old code. Or create a lot of FuzzyTermsEnums!

        Show
        Uwe Schindler added a comment - The MethodHandle code is also used to create the attributes in the DEFAULT_ATTRIBUTE_FACTORY. And this improves a lot in comparison to the old reflection based approach. I am sorry for the first patch yesterday to be slower. In fact it was unfair, was just to late in the evening (4 am in the morning): I did not use invokeExact and because of that, unter the hood it used the similar code like reflection. The unfair part was that the stupid bug I had was to implicit use reflection to create the Token instance, and this was bad. With invokeExact its now identical to all code around for sorting and lists of Java 8. All code there uses the exact same pattern, generated by the javac compiler. Invokedynamic is not involved at all. Invokedynamic is something else, here we have hard linked ctor calls. You standpoint is like forbidding this syntax in Java 8: Arrays.sort(array, Integer::compare); - this is heavily tested and as fast as Arrays.sort(array, new Comparator() { ... }; This is just some hotspot noise. It would be really nice (instead of complaining here about nonsense, sorry), if we could check the performance of this with non-packed AttributeImpls. E.g. for TokenStreams adding KeywordAttribute or PayloadAttribute or JapaneseFoobarAttribute. Here we should see a large improvement in contrast to old code. Or create a lot of FuzzyTermsEnums!
        Hide
        Robert Muir added a comment -

        OK, ill benchmark it (against both packed and unpacked). I ran several iterations when i benchmarked the patch last night to explicitly eliminate noise. I can do more so that we know if something is noise or if it is not

        Show
        Robert Muir added a comment - OK, ill benchmark it (against both packed and unpacked). I ran several iterations when i benchmarked the patch last night to explicitly eliminate noise. I can do more so that we know if something is noise or if it is not
        Hide
        Uwe Schindler added a comment -

        How about this:

        • I wait for your additional benchmark: I changed the code already to be more effective, maybe that makes it. It wozuld also be good to one time compare the combined case with Java 8 instead Java 7. Maybe there are improvements in Java 8 not yet in Java 7 regarding MethodHandle. In that case your argument is of course correct and it may be to early to change the default packed attribute imp
        • If the benchmark shows a neglectible slowdown (I don't accept 1% or 2% or like that - if you change the JDK version this 1% is easily reached and is in most cases just the happyness of Hotspot), I will change the default attribute factory for Token using PackedTokenAttributeImpl to have the explicit ctor call.
        • The others like the deprecated Token#TokenAttributeFactory and the mayn test-ones can use the dynamic code. In trunk I would like to let the non-explicit one in trunk, so I would commit the explicit new PackedTokenAttributeImpl() in 4.x. Is this a deal? Trunk will go to Java 8 at some time, so we are on the right path.
        • MethodHandles instead of Constrcutor instances should really be used for the DEFAULT_ATTRIBUTE_FACTORY because there should be a huge improvement (I did quick Mircobench: Class#newInstance() slower than Constructor#newInstance muuuuch slower than MethodHandle#invokeExact. This is the case, because all the access checks and type conversions are done one time and not on every instantiation.
        Show
        Uwe Schindler added a comment - How about this: I wait for your additional benchmark: I changed the code already to be more effective, maybe that makes it. It wozuld also be good to one time compare the combined case with Java 8 instead Java 7. Maybe there are improvements in Java 8 not yet in Java 7 regarding MethodHandle. In that case your argument is of course correct and it may be to early to change the default packed attribute imp If the benchmark shows a neglectible slowdown (I don't accept 1% or 2% or like that - if you change the JDK version this 1% is easily reached and is in most cases just the happyness of Hotspot), I will change the default attribute factory for Token using PackedTokenAttributeImpl to have the explicit ctor call. The others like the deprecated Token#TokenAttributeFactory and the mayn test-ones can use the dynamic code. In trunk I would like to let the non-explicit one in trunk, so I would commit the explicit new PackedTokenAttributeImpl() in 4.x. Is this a deal? Trunk will go to Java 8 at some time, so we are on the right path. MethodHandles instead of Constrcutor instances should really be used for the DEFAULT_ATTRIBUTE_FACTORY because there should be a huge improvement (I did quick Mircobench: Class#newInstance() slower than Constructor#newInstance muuuuch slower than MethodHandle#invokeExact. This is the case, because all the access checks and type conversions are done one time and not on every instantiation.
        Hide
        Robert Muir added a comment -

        I ran the benchmark 10 times for trunk and patch (indexing with reuse disabled, time in milliseconds, StandardAnalyzer)

        Trunk
        mean: 24386.1
        median: 24377.0
        stddev: 189.4

        Patch
        mean: 25047.6
        median: 25076.5
        stddev: 151.8

        I tried with JapaneseAnalyzer, thinking the patch would be faster in this case (has a lot of non-standard attributes). But its actually slower

        Trunk
        mean: 45059.3
        median: 45065
        stddev: 629.5

        Patch:
        mean: 51382.7
        median: 51496
        stddev: 196.3

        So something is up I think.

        Show
        Robert Muir added a comment - I ran the benchmark 10 times for trunk and patch (indexing with reuse disabled, time in milliseconds, StandardAnalyzer) Trunk mean: 24386.1 median: 24377.0 stddev: 189.4 Patch mean: 25047.6 median: 25076.5 stddev: 151.8 I tried with JapaneseAnalyzer, thinking the patch would be faster in this case (has a lot of non-standard attributes). But its actually slower Trunk mean: 45059.3 median: 45065 stddev: 629.5 Patch: mean: 51382.7 median: 51496 stddev: 196.3 So something is up I think.
        Hide
        Robert Muir added a comment -

        With java8, the performance is the exact opposite! so the patch is 10% faster with japanese where it was 10% slower with java7. And its slightly faster for standard where it was slghtly slower with java7.

        Show
        Robert Muir added a comment - With java8, the performance is the exact opposite! so the patch is 10% faster with japanese where it was 10% slower with java7. And its slightly faster for standard where it was slghtly slower with java7.
        Hide
        Robert Muir added a comment -

        and with update 55 (my previous benchmarks were update 25), it seems to be more or less the same (any difference appears to be noise). So i will not complain about that performance anymore.

        Show
        Robert Muir added a comment - and with update 55 (my previous benchmarks were update 25), it seems to be more or less the same (any difference appears to be noise). So i will not complain about that performance anymore.
        Hide
        Uwe Schindler added a comment - - edited

        I haven't looked closely here, but why does simplifying Token.java required new hairy things like MethodHandle / getConstructorForInterface / etc.?

        That's unrelated. This should improve the reflective default AttributeFactory (because it no longer uses refection, only initially to look up the instance). Whenever addAttribute() has to create a new attribute implementation it can create the instance without any overhead, almost as fast as "new SomethingImpl()". Before it was using reflection on every call with access checks, security manager,... Now its doing the reflective access only one time and then does a hard-linked invocation using the method handle. Its no hairy trick, its just Java 7 as documented. We were just not able to do this before. So this is a general improvement,

        getConstructorForInterface

        ... was already there before the change we just moved the impl to a new class. In the old code this was just using reflection to get the constructor. The method was renamed from {getClassForInterface}} to getConstructorForInterface to make it more typesafe - so no reflective access is needed.

        This is only partly related to Token, its just a general refactoring to have a packed attribute for the default attributes. This is just one big patch, sorry.

        Show
        Uwe Schindler added a comment - - edited I haven't looked closely here, but why does simplifying Token.java required new hairy things like MethodHandle / getConstructorForInterface / etc.? That's unrelated. This should improve the reflective default AttributeFactory (because it no longer uses refection, only initially to look up the instance). Whenever addAttribute() has to create a new attribute implementation it can create the instance without any overhead, almost as fast as "new SomethingImpl()". Before it was using reflection on every call with access checks, security manager,... Now its doing the reflective access only one time and then does a hard-linked invocation using the method handle. Its no hairy trick, its just Java 7 as documented. We were just not able to do this before. So this is a general improvement, getConstructorForInterface ... was already there before the change we just moved the impl to a new class. In the old code this was just using reflection to get the constructor. The method was renamed from {getClassForInterface}} to getConstructorForInterface to make it more typesafe - so no reflective access is needed. This is only partly related to Token, its just a general refactoring to have a packed attribute for the default attributes. This is just one big patch, sorry.
        Hide
        Uwe Schindler added a comment -

        New patch:

        • I improved the AttributeSource to no longer use a LinkedList for the implemented interfaces of an Impl
        • I am not really happy about this, but SoftReference was also wrong. After thinking more than 24 hrs about it, I have no good solution to prevent the static map from sitting on classes forever. The fix here is: We use MethodHandles only for attributes that were loaded by Lucene's own classloader. Really foreign ones (e.g. from Solr plugins) will solely use reflective mode to create new instances of attributes.

        Here the explanation:
        The problem is that we originally had a map Attribute -> AttributeImpl, which was weak for keys and values. By this the classloader can safely remove loaded classes, because we don't strongly reference the attributes nor the implementations. The values must be weak, too, otherwise the impl class indirectly strong references the interface (because its class object implements the interface). As Weak maps rely on the fact that the key get GCed, this will never happen, because of the strong value reference.

        This all went fine with the pure reflective approach. But when using method handles, we can no longer put them as Weak values into the map. Because nothing has a strong reference to the MethodHandle, its cleaned up asap. So we have the reflective access again and again. If we make the MethodHanlde a strong reference, we have the same problem like above: MethodHandle refers to its class (the AttributeImpl) and that one refers to the interface -> the key can never GCed.

        The workaround here is: The cache handles both cases (MethodHandles and reflective). The MethodHandles are strong references and are only used for Lucene's own classloader. So we cannot sit on foreign classes from plugins. Foreign interface implementations are handled like before.

        The fast path in createAttributeInstance() is now:

        • check cache
        • if cache contains MethodHandle invoke it directly (this is faster than before, because no wek ref dereferring)
        • if cache contains a Reference object (pointing to our the impl class), we use reflective: Class#newInstance() [same as in earlier Lucene versions]
        • if the cache does not have an entry, we load the class as always
        • if classloader is our own, we resolve the method handle and store it as hard reference in the weak map and execute it
        • otherwise we put a WeakReference to the class for later reflective access into the map.

        We have a test: newAttributeImpl in BaseTokenStreamTestcase chooses between all 4 AttributeFactories: DEFAULT (MethodHandles and Reflective), Reflective only, the packed impl, and Token's attribute factory.

        Show
        Uwe Schindler added a comment - New patch: I improved the AttributeSource to no longer use a LinkedList for the implemented interfaces of an Impl I am not really happy about this, but SoftReference was also wrong. After thinking more than 24 hrs about it, I have no good solution to prevent the static map from sitting on classes forever. The fix here is: We use MethodHandles only for attributes that were loaded by Lucene's own classloader. Really foreign ones (e.g. from Solr plugins) will solely use reflective mode to create new instances of attributes. Here the explanation: The problem is that we originally had a map Attribute -> AttributeImpl, which was weak for keys and values. By this the classloader can safely remove loaded classes, because we don't strongly reference the attributes nor the implementations. The values must be weak, too, otherwise the impl class indirectly strong references the interface (because its class object implements the interface). As Weak maps rely on the fact that the key get GCed, this will never happen, because of the strong value reference. This all went fine with the pure reflective approach. But when using method handles, we can no longer put them as Weak values into the map. Because nothing has a strong reference to the MethodHandle, its cleaned up asap. So we have the reflective access again and again. If we make the MethodHanlde a strong reference, we have the same problem like above: MethodHandle refers to its class (the AttributeImpl) and that one refers to the interface -> the key can never GCed. The workaround here is: The cache handles both cases (MethodHandles and reflective). The MethodHandles are strong references and are only used for Lucene's own classloader. So we cannot sit on foreign classes from plugins. Foreign interface implementations are handled like before. The fast path in createAttributeInstance() is now: check cache if cache contains MethodHandle invoke it directly (this is faster than before, because no wek ref dereferring) if cache contains a Reference object (pointing to our the impl class), we use reflective: Class#newInstance() [same as in earlier Lucene versions] if the cache does not have an entry, we load the class as always if classloader is our own, we resolve the method handle and store it as hard reference in the weak map and execute it otherwise we put a WeakReference to the class for later reflective access into the map. We have a test: newAttributeImpl in BaseTokenStreamTestcase chooses between all 4 AttributeFactories: DEFAULT (MethodHandles and Reflective), Reflective only, the packed impl, and Token's attribute factory.
        Hide
        Uwe Schindler added a comment -

        Added some tests.

        Will commit this in a moment.

        Show
        Uwe Schindler added a comment - Added some tests. Will commit this in a moment.
        Hide
        ASF subversion and git services added a comment -

        Commit 1592914 from Uwe Schindler in branch 'dev/trunk'
        [ https://svn.apache.org/r1592914 ]

        LUCENE-5640: Refactor Token, add new PackedTokenAttributeImpl, make use of Java 7 MethodHandles in DEFAULT_ATTRIBUTE_FACTORY

        Show
        ASF subversion and git services added a comment - Commit 1592914 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1592914 ] LUCENE-5640 : Refactor Token, add new PackedTokenAttributeImpl, make use of Java 7 MethodHandles in DEFAULT_ATTRIBUTE_FACTORY
        Hide
        Uwe Schindler added a comment -

        Patch after merge to 4.x.

        Show
        Uwe Schindler added a comment - Patch after merge to 4.x.
        Hide
        ASF subversion and git services added a comment -

        Commit 1592929 from Uwe Schindler in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1592929 ]

        Merged revision(s) 1592914 from lucene/dev/trunk:
        LUCENE-5640: Refactor Token, add new PackedTokenAttributeImpl, make use of Java 7 MethodHandles in DEFAULT_ATTRIBUTE_FACTORY

        Show
        ASF subversion and git services added a comment - Commit 1592929 from Uwe Schindler in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1592929 ] Merged revision(s) 1592914 from lucene/dev/trunk: LUCENE-5640 : Refactor Token, add new PackedTokenAttributeImpl, make use of Java 7 MethodHandles in DEFAULT_ATTRIBUTE_FACTORY
        Hide
        ASF subversion and git services added a comment -

        Commit 1592930 from Uwe Schindler in branch 'dev/trunk'
        [ https://svn.apache.org/r1592930 ]

        LUCENE-5640: Remove deprecated stuff

        Show
        ASF subversion and git services added a comment - Commit 1592930 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1592930 ] LUCENE-5640 : Remove deprecated stuff

          People

          • Assignee:
            Uwe Schindler
            Reporter:
            Uwe Schindler
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development