Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.9, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      If you don't reuse your Doc/Field instances (which is very expert: I
      suspect few apps do) then there's a lot of garbage created to index each
      StringField because we make a new StringTokenStream or
      NumericTokenStream (and their Attributes).

      We should be able to re-use these instances via a static
      ThreadLocal...

      1. LUCENE-5634.patch
        14 kB
        Robert Muir
      2. LUCENE-5634.patch
        7 kB
        Robert Muir
      3. LUCENE-5634.patch
        3 kB
        Michael McCandless
      4. LUCENE-5634.patch
        2 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Initial patch, but I'd like to find a way to reuse NumericTokenStream
        too ... but it's trickier since the precStep is final (maybe we can
        un-final it and add a setter?)

        I ran a quick test, indexing all (8.6 M docs) of Geonames, which is a
        good test for "tiny documents" ... with trunk it 56 seconds and with
        the patch it's 45 seconds, ~20% faster.

        Show
        Michael McCandless added a comment - Initial patch, but I'd like to find a way to reuse NumericTokenStream too ... but it's trickier since the precStep is final (maybe we can un-final it and add a setter?) I ran a quick test, indexing all (8.6 M docs) of Geonames, which is a good test for "tiny documents" ... with trunk it 56 seconds and with the patch it's 45 seconds, ~20% faster.
        Hide
        Shay Banon added a comment -

        this optimization has proven to help a lot in the context of ES, but we can use a static thread local since we are fully in control of the threading model. With Lucene itself, where it can be used in many different environment, then this can cause some unexpected behavior. For example, this might cause Tomcat to warn on leaking resources when unloading a war.

        Show
        Shay Banon added a comment - this optimization has proven to help a lot in the context of ES, but we can use a static thread local since we are fully in control of the threading model. With Lucene itself, where it can be used in many different environment, then this can cause some unexpected behavior. For example, this might cause Tomcat to warn on leaking resources when unloading a war.
        Hide
        Michael McCandless added a comment -

        Here's an alternate approach, pulled from early iterations on LUCENE-5611, to specialize indexing just a single string ... there are still nocommits, it needs to be more generic to any not tokenized field, etc. It's sort of silly to build up an entire TokenStream when really you just need to index the one token ...

        This patch indexes geonames in ~38.5 sec, or ~31% faster than trunk

        Show
        Michael McCandless added a comment - Here's an alternate approach, pulled from early iterations on LUCENE-5611 , to specialize indexing just a single string ... there are still nocommits, it needs to be more generic to any not tokenized field, etc. It's sort of silly to build up an entire TokenStream when really you just need to index the one token ... This patch indexes geonames in ~38.5 sec, or ~31% faster than trunk
        Hide
        Uwe Schindler added a comment - - edited

        but it's trickier since the precStep is final (maybe we can un-final it and add a setter?)

        Please don't do this. It is maybe better to do it like in Elasticsearch: Have a pool of NTS for each precision step.

        this optimization has proven to help a lot in the context of ES, but we can use a static thread local since we are fully in control of the threading model. With Lucene itself, where it can be used in many different environment, then this can cause some unexpected behavior. For example, this might cause Tomcat to warn on leaking resources when unloading a war.

        Thanks Shay: This is really the reason why we always refused to use static ThreadLocals in Lucene, especially for those heavy used components.

        Maybe we can do a similar thing like with StringField in Mike's patch. Its a bit crazy to move out the TokenStreams from the field, but we can do this for performance here. Just have a lazy init pool of NumericTokenStreams for each precisionStep in each per thread DocumentsWriter (DefaultIndexingChain).

        -1 to add thread locals in Lucene here!

        Another idea how to manage the pools: Maybe add a protected method to Field that can get the DocumentsWriter instance and add some caching functionality for arbitrary TokenStreams (not just NumericTS or StringTS): Maybe some method on the per thread DocumentsWriter to set aTokenStream for reuse per field. The field (also custom ones) then could use setCachedTokenStream/getCachedTokenStream through the DocumentsWriter accessor from inside the Field.

        Show
        Uwe Schindler added a comment - - edited but it's trickier since the precStep is final (maybe we can un-final it and add a setter?) Please don't do this. It is maybe better to do it like in Elasticsearch: Have a pool of NTS for each precision step. this optimization has proven to help a lot in the context of ES, but we can use a static thread local since we are fully in control of the threading model. With Lucene itself, where it can be used in many different environment, then this can cause some unexpected behavior. For example, this might cause Tomcat to warn on leaking resources when unloading a war. Thanks Shay: This is really the reason why we always refused to use static ThreadLocals in Lucene, especially for those heavy used components. Maybe we can do a similar thing like with StringField in Mike's patch. Its a bit crazy to move out the TokenStreams from the field, but we can do this for performance here. Just have a lazy init pool of NumericTokenStreams for each precisionStep in each per thread DocumentsWriter (DefaultIndexingChain). -1 to add thread locals in Lucene here! Another idea how to manage the pools: Maybe add a protected method to Field that can get the DocumentsWriter instance and add some caching functionality for arbitrary TokenStreams (not just NumericTS or StringTS): Maybe some method on the per thread DocumentsWriter to set aTokenStream for reuse per field. The field (also custom ones) then could use setCachedTokenStream/getCachedTokenStream through the DocumentsWriter accessor from inside the Field.
        Hide
        Uwe Schindler added a comment -

        Another idea: Maybe add a parameter to Field#tokenStream(), passing the previously cached instance! By this the field could obviously reuse the TokenStream, if the type (instanceof check) is correct. If not, throw it away and create a new one. The indexer then manages the cache (its just a field in DefaultIndexingChain or DocumentsWriter).

        Show
        Uwe Schindler added a comment - Another idea: Maybe add a parameter to Field#tokenStream(), passing the previously cached instance! By this the field could obviously reuse the TokenStream, if the type (instanceof check) is correct. If not, throw it away and create a new one. The indexer then manages the cache (its just a field in DefaultIndexingChain or DocumentsWriter).
        Hide
        Robert Muir added a comment -

        Another idea: Maybe add a parameter to Field#tokenStream(), passing the previously cached instance! By this the field could obviously reuse the TokenStream, if the type (instanceof check) is correct. If not, throw it away and create a new one. The indexer then manages the cache (its just a field in DefaultIndexingChain or DocumentsWriter).

        I like this idea better. Lets try and see how bad it looks.

        Show
        Robert Muir added a comment - Another idea: Maybe add a parameter to Field#tokenStream(), passing the previously cached instance! By this the field could obviously reuse the TokenStream, if the type (instanceof check) is correct. If not, throw it away and create a new one. The indexer then manages the cache (its just a field in DefaultIndexingChain or DocumentsWriter). I like this idea better. Lets try and see how bad it looks.
        Hide
        Michael McCandless added a comment -

        Maybe add a parameter to Field#tokenStream(), passing the previously cached instance!

        This sounds like a good idea!

        Show
        Michael McCandless added a comment - Maybe add a parameter to Field#tokenStream(), passing the previously cached instance! This sounds like a good idea!
        Hide
        Robert Muir added a comment -

        here is a patch. Tests seem happy, but i didnt benchmark or yet write explicit test.

        Personally I think its bogus: I don't like that these fields (StringField, NumericField) "backdoor" the analyzer and to me thats the real bug. But I am ok with the change as a step, because it only makes the low-level interface api more bogus.

        Show
        Robert Muir added a comment - here is a patch. Tests seem happy, but i didnt benchmark or yet write explicit test. Personally I think its bogus: I don't like that these fields (StringField, NumericField) "backdoor" the analyzer and to me thats the real bug. But I am ok with the change as a step, because it only makes the low-level interface api more bogus.
        Hide
        Michael McCandless added a comment -

        +1, patch looks good.

        I ran IndexGeoNames again, it took 37.6 seconds, which is a big speedup over trunk (55.6 seconds). However, it's only doing StringField right now ... I'll re-test w/ NumericField too.

        Show
        Michael McCandless added a comment - +1, patch looks good. I ran IndexGeoNames again, it took 37.6 seconds, which is a big speedup over trunk (55.6 seconds). However, it's only doing StringField right now ... I'll re-test w/ NumericField too.
        Hide
        Michael McCandless added a comment -

        OK with NumericField, full Geonames index takes 129.7 sec on trunk and 101.0 sec with last patch... nice speedup.

        Show
        Michael McCandless added a comment - OK with NumericField, full Geonames index takes 129.7 sec on trunk and 101.0 sec with last patch... nice speedup.
        Hide
        Michael McCandless added a comment -

        BTW, that test was with precStep=8. If I use precStep=4 (still the default, we really have to fix LUCENE-5609!) then indexing time for Geonames with the patch is 164.8 sec (63% slower!).

        Show
        Michael McCandless added a comment - BTW, that test was with precStep=8. If I use precStep=4 (still the default, we really have to fix LUCENE-5609 !) then indexing time for Geonames with the patch is 164.8 sec (63% slower!).
        Hide
        Uwe Schindler added a comment -

        Patch looks fine. I was afraid of complexity, but that looks quite good. I am not sure about backwards compatibility issues, but implementing your own IndexableField instance is still very expert. With Java 8 we could handle that with default interface methods (LOOOOOOL).

        The current patch is fine for the 2 special cases, although its a bit risky, if we add new "settings" to NTS or change its API (we should have equals...). Maybe in LUCENE-5605 we can improve the check. If we pass FieldType directly to NTS and NRQ, we can handle the whole thing by comparing the field type and not rely on crazy internals like precStep.

        It would be great if we could in the future remove the ThreadLocal from Analyzer, too - by using the same trick. Unfortunately with the current contract on TokenStream its hard to compare, unless we have a well-defined TokenStream#equals(). Ideally TokenStream#equals() should compare the "settings" of the stream and its inputs (for Filters), but that is too advanced for the simple 2 cases.

        Another solution for this would be to have some "holder" around the TokenStream thats cached and provides hashcode/equals. By that a Field could determine better if its his own tokenstream (e.g. by putting a refernce to its field type into the holder).

        Show
        Uwe Schindler added a comment - Patch looks fine. I was afraid of complexity, but that looks quite good. I am not sure about backwards compatibility issues, but implementing your own IndexableField instance is still very expert. With Java 8 we could handle that with default interface methods (LOOOOOOL). The current patch is fine for the 2 special cases, although its a bit risky, if we add new "settings" to NTS or change its API (we should have equals...). Maybe in LUCENE-5605 we can improve the check. If we pass FieldType directly to NTS and NRQ, we can handle the whole thing by comparing the field type and not rely on crazy internals like precStep. It would be great if we could in the future remove the ThreadLocal from Analyzer, too - by using the same trick. Unfortunately with the current contract on TokenStream its hard to compare, unless we have a well-defined TokenStream#equals(). Ideally TokenStream#equals() should compare the "settings" of the stream and its inputs (for Filters), but that is too advanced for the simple 2 cases. Another solution for this would be to have some "holder" around the TokenStream thats cached and provides hashcode/equals. By that a Field could determine better if its his own tokenstream (e.g. by putting a refernce to its field type into the holder).
        Hide
        Uwe Schindler added a comment -

        BTW, that test was with precStep=8. If I use precStep=4 (still the default, we really have to fix LUCENE-5609!) then indexing time for Geonames with the patch is 164.8 sec (63% slower!).

        HÄ? How comes, makes no sense to me. Are you sure you are doing the right thing? Or are you comparing the speedup by this patch in combination with the precision step change?

        Show
        Uwe Schindler added a comment - BTW, that test was with precStep=8. If I use precStep=4 (still the default, we really have to fix LUCENE-5609 !) then indexing time for Geonames with the patch is 164.8 sec (63% slower!). HÄ? How comes, makes no sense to me. Are you sure you are doing the right thing? Or are you comparing the speedup by this patch in combination with the precision step change?
        Hide
        Robert Muir added a comment -

        I would prefer to simply break the interface rather than do anything sophisticated here. Its a very expert low-level one. The patch had very minimal impact to the codebase.

        I think its good to defer stuff with Analyzer and not do that here, that has a lot of consumers like QueryParsers, MoreLikeThis, Suggesters, ... Thats a more complex issue. I am unsure that adding things like equals is a good idea, it might make things very complex. For now, if you implement your own subclass, you can just ignore the parameter, and its the same performance and so on.

        I will upload a new patch with tests (including doing stupid things).

        Show
        Robert Muir added a comment - I would prefer to simply break the interface rather than do anything sophisticated here. Its a very expert low-level one. The patch had very minimal impact to the codebase. I think its good to defer stuff with Analyzer and not do that here, that has a lot of consumers like QueryParsers, MoreLikeThis, Suggesters, ... Thats a more complex issue. I am unsure that adding things like equals is a good idea, it might make things very complex. For now, if you implement your own subclass, you can just ignore the parameter, and its the same performance and so on. I will upload a new patch with tests (including doing stupid things).
        Hide
        Uwe Schindler added a comment -

        I would prefer to simply break the interface rather than do anything sophisticated here. Its a very expert low-level one. The patch had very minimal impact to the codebase.

        +1. Nevertheless as we change a public interface, it should be mentioned in "Backwards Breaks".

        Show
        Uwe Schindler added a comment - I would prefer to simply break the interface rather than do anything sophisticated here. Its a very expert low-level one. The patch had very minimal impact to the codebase. +1. Nevertheless as we change a public interface, it should be mentioned in "Backwards Breaks".
        Hide
        Robert Muir added a comment -

        Updated patch with tests.

        Show
        Robert Muir added a comment - Updated patch with tests.
        Hide
        Uwe Schindler added a comment -

        +1 I am fine with that patch!

        Show
        Uwe Schindler added a comment - +1 I am fine with that patch!
        Hide
        ASF subversion and git services added a comment -

        Commit 1591992 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1591992 ]

        LUCENE-5634: Reuse TokenStream instances for string and numeric Fields

        Show
        ASF subversion and git services added a comment - Commit 1591992 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1591992 ] LUCENE-5634 : Reuse TokenStream instances for string and numeric Fields
        Hide
        Michael McCandless added a comment -

        Or are you comparing the speedup by this patch in combination with the precision step change?

        Baseline was the patch w/ precStep=8 and comp was the patch w/ precStep=4. I just re-ran to be sure; this is IndexGeoNames.java in luceneutil if you want to try ... it's easy to run, you just need to download/unzip geonames corpus first. Net/net precStep=4 is very costly and doesn't seem to buy much query time speedups from my tests on LUCENE-5609.

        Show
        Michael McCandless added a comment - Or are you comparing the speedup by this patch in combination with the precision step change? Baseline was the patch w/ precStep=8 and comp was the patch w/ precStep=4. I just re-ran to be sure; this is IndexGeoNames.java in luceneutil if you want to try ... it's easy to run, you just need to download/unzip geonames corpus first. Net/net precStep=4 is very costly and doesn't seem to buy much query time speedups from my tests on LUCENE-5609 .
        Hide
        ASF subversion and git services added a comment -

        Commit 1592005 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1592005 ]

        LUCENE-5634: Reuse TokenStream instances for string and numeric Fields

        Show
        ASF subversion and git services added a comment - Commit 1592005 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1592005 ] LUCENE-5634 : Reuse TokenStream instances for string and numeric Fields

          People

          • Assignee:
            Unassigned
            Reporter:
            Michael McCandless
          • Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development