Lucene - Core
  1. Lucene - Core
  2. LUCENE-6814

PatternTokenizer indefinitely holds heap equal to max field it has ever tokenized

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.4, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Caught by Alex Chow in this Elasticsearch issue: https://github.com/elastic/elasticsearch/issues/13721

      Today, PatternTokenizer reuses a single StringBuilder, but it doesn't free its heap usage after tokenizing is done. We can either stop reusing, or ask it to .trimToSize when we are done ...

      1. LUCENE-6814.patch
        3 kB
        Michael McCandless
      2. LUCENE-6814.patch
        1 kB
        Michael McCandless

        Activity

        Hide
        Uwe Schindler added a comment -

        It should call something like setLength(MIN_SIZE).trimToSize() on close().

        Show
        Uwe Schindler added a comment - It should call something like setLength(MIN_SIZE).trimToSize() on close().
        Hide
        Michael McCandless added a comment -

        on close().

        Maybe on end() instead?

        Show
        Michael McCandless added a comment - on close(). Maybe on end() instead?
        Hide
        Uwe Schindler added a comment -

        makes no sense. close() is correct - see documentation

        Show
        Uwe Schindler added a comment - makes no sense. close() is correct - see documentation
        Hide
        Michael McCandless added a comment -

        Patch, setting length and trimming in end() ....

        Show
        Michael McCandless added a comment - Patch, setting length and trimming in end() ....
        Hide
        Michael McCandless added a comment -

        makes no sense. close() is correct - see documentation

        OK right, sorry

        close() is in fact called each time we analyze a String/Reader ... I always get confused by this

        Show
        Michael McCandless added a comment - makes no sense. close() is correct - see documentation OK right, sorry close() is in fact called each time we analyze a String/Reader ... I always get confused by this
        Hide
        Michael McCandless added a comment -

        New patch, moving the clearing to close and adding a fun test case.

        Show
        Michael McCandless added a comment - New patch, moving the clearing to close and adding a fun test case.
        Hide
        Alex Chow added a comment - - edited

        Thanks for taking a look at this, Michael McCandless!

        It'd be great if this can get added to 4.10 so elasticsearch 1.x can pull it in too.

        (Never mind the extra stuff that was here. It all made sense when I tried looking through it)

        Show
        Alex Chow added a comment - - edited Thanks for taking a look at this, Michael McCandless ! It'd be great if this can get added to 4.10 so elasticsearch 1.x can pull it in too. (Never mind the extra stuff that was here. It all made sense when I tried looking through it)
        Hide
        Michael McCandless added a comment -

        Commenting on the stuff you edited away because they are good confusions!

        I'm curious why there shouldn't there be some trimming in `end()` as well? Or is a `TokenStream` meant to be used only once (no multiple `reset()`, `incrementToken()`, `end()` on the same `TokenStream`)?

        The TokenStream API is confusing

        I started with end here too (it seemed correct) but it turns out close is also called (internally, in Lucene's IndexWriter) after all tokens are iterated for a single input, but close is called even on exception (but end is not necessarily I think).

        The TokenStream instance is typically thread-private, and re-used (for that one thread) for analysis on future docs.

        Elasticsearch seems to never reinstantiate Tokenizers and just reuses them for each field in an index, though I may be wrong. Or elasticsearch is using TokenStream the wrong way?

        ES using Lucene's Analyzer (well, DelegatingAnalyzerWrapper I think), which (by default) reuses the Tokenizer instance, per thread.

        It'd be great if this can get added to 4.10 so elasticsearch 1.x can pull it in too.

        I think it's unlikely Lucene will have another 4.10.x release, and ES is releasing 2.0.0 (using Lucene 5.3.x) shortly.

        Can you describe what impact you're seeing from this bug? How many PatternTokenizer instances is ES keeping in your case, how large are your docs, etc.? You could probably lower the ES bulk indexing thread pool size (if you don't in fact need so much concurrency) to reduce the impact of the bug ...

        I think this bug means PatternTokenizer holds onto the max sized doc it ever saw in heap right? Does StringBuilder ever reduce its allocated space by itself...

        Show
        Michael McCandless added a comment - Commenting on the stuff you edited away because they are good confusions! I'm curious why there shouldn't there be some trimming in `end()` as well? Or is a `TokenStream` meant to be used only once (no multiple `reset()`, `incrementToken()`, `end()` on the same `TokenStream`)? The TokenStream API is confusing I started with end here too (it seemed correct) but it turns out close is also called (internally, in Lucene's IndexWriter ) after all tokens are iterated for a single input, but close is called even on exception (but end is not necessarily I think). The TokenStream instance is typically thread-private, and re-used (for that one thread) for analysis on future docs. Elasticsearch seems to never reinstantiate Tokenizers and just reuses them for each field in an index, though I may be wrong. Or elasticsearch is using TokenStream the wrong way? ES using Lucene's Analyzer (well, DelegatingAnalyzerWrapper I think), which (by default) reuses the Tokenizer instance, per thread. It'd be great if this can get added to 4.10 so elasticsearch 1.x can pull it in too. I think it's unlikely Lucene will have another 4.10.x release, and ES is releasing 2.0.0 (using Lucene 5.3.x) shortly. Can you describe what impact you're seeing from this bug? How many PatternTokenizer instances is ES keeping in your case, how large are your docs, etc.? You could probably lower the ES bulk indexing thread pool size (if you don't in fact need so much concurrency) to reduce the impact of the bug ... I think this bug means PatternTokenizer holds onto the max sized doc it ever saw in heap right? Does StringBuilder ever reduce its allocated space by itself...
        Hide
        Uwe Schindler added a comment - - edited

        Tokenstream consumer workflow is clearly defined:
        https://lucene.apache.org/core/5_3_1/core/org/apache/lucene/analysis/TokenStream.html

        The last step is close(). There is nothing confusing, just RTFM.

        close() is defined as "Releases resources associated with this stream.", and that's what we do here. end() has a different meaning. Its sole purpose is to get the "token" information after the very last token. At this time all resources still need to be hold, because the attributes must be accessible after this call. After close() the resources are freed. In fact, this does not make a difference for this tokenizer, but in general stuff like termattribute must still be in a valid state.

        Show
        Uwe Schindler added a comment - - edited Tokenstream consumer workflow is clearly defined: https://lucene.apache.org/core/5_3_1/core/org/apache/lucene/analysis/TokenStream.html The last step is close(). There is nothing confusing, just RTFM. close() is defined as "Releases resources associated with this stream.", and that's what we do here. end() has a different meaning. Its sole purpose is to get the "token" information after the very last token. At this time all resources still need to be hold, because the attributes must be accessible after this call. After close() the resources are freed. In fact, this does not make a difference for this tokenizer, but in general stuff like termattribute must still be in a valid state.
        Hide
        Alex Chow added a comment -

        Tokenstream consumer workflow is clearly defined:
        https://lucene.apache.org/core/5_3_1/core/org/apache/lucene/analysis/TokenStream.html
        The last step is close(). There is nothing confusing, just RTFM.

        Yeah, just took a little bit to fully understand how things interact with reuse ("expert mode").

        I think it's unlikely Lucene will have another 4.10.x release, and ES is releasing 2.0.0 (using Lucene 5.3.x) shortly.

        That's pretty unfortunate. The tracker suggests that there are some patches slated for 4.10.5. Is 4.10 dead now?

        I'm not able to find what the plan for 1.x is when 2.0 hits GA. 1.7 was cut in July, so that release is probably going to be supported through early 2017. I'll probably have to push for the PatternTokenizer fork to get it into 1.x?

        Can you describe what impact you're seeing from this bug? How many PatternTokenizer instances is ES keeping in your case, how large are your docs, etc.? You could probably lower the ES bulk indexing thread pool size (if you don't in fact need so much concurrency) to reduce the impact of the bug...

        Our setup has 24 bulk indexing threads, and at peak we go through about 18 tasks/s (8k documents indexed per bulk request) per node with 21 nodes and 168 indices.

        We've been making an effort to reduce heap sizes (from 22gb to 8gb using 3x nodes), and PatternTokenizer ends up retaining around 2gb before we get into a GC death spiral (CMS with instantiating occupancy=75%) but would otherwise grow a bit more. The biggest PatternTokenizer instances get to about 3-4mb. SegmentReaders occupy 3.5gb per node, so there's not too much space to work with unless we want to increase heap at least 1.5x. This pretty much destroys scaling horizontally and ends up being more... diagonally with volume depending on the data?

        I'm pretty surprised nobody has really noticed this so far. Does everyone just like huge heaps (or just not use PatternAnalyzer)?

        I think this bug means PatternTokenizer holds onto the max sized doc it ever saw in heap right? Does StringBuilder ever reduce its allocated space by itself...

        I think this ends up growing to the max sized field. As far as I can tell trimToSize is the only for StringBuilder to shrink its buffer, and even then documentation suggests it isn't guaranteed (lots of "may")...

        Show
        Alex Chow added a comment - Tokenstream consumer workflow is clearly defined: https://lucene.apache.org/core/5_3_1/core/org/apache/lucene/analysis/TokenStream.html The last step is close(). There is nothing confusing, just RTFM. Yeah, just took a little bit to fully understand how things interact with reuse ("expert mode"). I think it's unlikely Lucene will have another 4.10.x release, and ES is releasing 2.0.0 (using Lucene 5.3.x) shortly. That's pretty unfortunate. The tracker suggests that there are some patches slated for 4.10.5. Is 4.10 dead now? I'm not able to find what the plan for 1.x is when 2.0 hits GA. 1.7 was cut in July, so that release is probably going to be supported through early 2017. I'll probably have to push for the PatternTokenizer fork to get it into 1.x? Can you describe what impact you're seeing from this bug? How many PatternTokenizer instances is ES keeping in your case, how large are your docs, etc.? You could probably lower the ES bulk indexing thread pool size (if you don't in fact need so much concurrency) to reduce the impact of the bug... Our setup has 24 bulk indexing threads, and at peak we go through about 18 tasks/s (8k documents indexed per bulk request) per node with 21 nodes and 168 indices. We've been making an effort to reduce heap sizes (from 22gb to 8gb using 3x nodes), and PatternTokenizer ends up retaining around 2gb before we get into a GC death spiral (CMS with instantiating occupancy=75%) but would otherwise grow a bit more. The biggest PatternTokenizer instances get to about 3-4mb. SegmentReaders occupy 3.5gb per node, so there's not too much space to work with unless we want to increase heap at least 1.5x. This pretty much destroys scaling horizontally and ends up being more... diagonally with volume depending on the data? I'm pretty surprised nobody has really noticed this so far. Does everyone just like huge heaps (or just not use PatternAnalyzer )? I think this bug means PatternTokenizer holds onto the max sized doc it ever saw in heap right? Does StringBuilder ever reduce its allocated space by itself... I think this ends up growing to the max sized field. As far as I can tell trimToSize is the only for StringBuilder to shrink its buffer, and even then documentation suggests it isn't guaranteed (lots of "may")...
        Hide
        Michael McCandless added a comment -

        Is 4.10 dead now?

        It's not that it's dead, it's more that there would need to be a really really bad bug to warrant another 4.10.x release at this point. We are already well into 5.x, and just released 5.3.1 bug fix today: http://lucene.apache.org/#24-september-2015-apache-lucene-531-and-apache-solr-531-available

        I'll probably have to push for the PatternTokenizer fork to get it into 1.x?

        Yes, maybe ... or patch your ES locally with the change?

        and PatternTokenizer ends up retaining around 2gb

        Hmm, how many PatternTokenizer instances do you have? With 24 bulk indexing threads you should (I think?) only have at most 24 instances * 4 MB max should be ~100 MB.

        Does everyone just like huge heaps (or just not use PatternAnalyzer)?

        I suspect PatternTokenizer is not commonly used ...

        There is a TODO in the code to fix this class to not hold an entire copy of the document ...

        I think this ends up growing to the max sized field.

        OK I'll put that in the CHANGES when I commit, and fix this issue title.

        Show
        Michael McCandless added a comment - Is 4.10 dead now? It's not that it's dead, it's more that there would need to be a really really bad bug to warrant another 4.10.x release at this point. We are already well into 5.x, and just released 5.3.1 bug fix today: http://lucene.apache.org/#24-september-2015-apache-lucene-531-and-apache-solr-531-available I'll probably have to push for the PatternTokenizer fork to get it into 1.x? Yes, maybe ... or patch your ES locally with the change? and PatternTokenizer ends up retaining around 2gb Hmm, how many PatternTokenizer instances do you have? With 24 bulk indexing threads you should (I think?) only have at most 24 instances * 4 MB max should be ~100 MB. Does everyone just like huge heaps (or just not use PatternAnalyzer)? I suspect PatternTokenizer is not commonly used ... There is a TODO in the code to fix this class to not hold an entire copy of the document ... I think this ends up growing to the max sized field. OK I'll put that in the CHANGES when I commit, and fix this issue title.
        Hide
        Alex Chow added a comment -

        Hmm, how many PatternTokenizer instances do you have? With 24 bulk indexing threads you should (I think?) only have at most 24 instances * 4 MB max should be ~100 MB.

        PatternTokenizer instances also scale with the number of indices (technically shards?) if you use your own analyzer based on a PatternAnalyzer. Custom analyzers get instantiated per index since they have their own mappings. We have 168 indices, so that's a lot. One heap dump I was poking around had 3432 instances, but it'd probably have grown more since some analyzers haven't had to create their TokenStreamComponents yet.

        Show
        Alex Chow added a comment - Hmm, how many PatternTokenizer instances do you have? With 24 bulk indexing threads you should (I think?) only have at most 24 instances * 4 MB max should be ~100 MB. PatternTokenizer instances also scale with the number of indices (technically shards?) if you use your own analyzer based on a PatternAnalyzer . Custom analyzers get instantiated per index since they have their own mappings. We have 168 indices, so that's a lot . One heap dump I was poking around had 3432 instances, but it'd probably have grown more since some analyzers haven't had to create their TokenStreamComponents yet.
        Hide
        ASF subversion and git services added a comment -

        Commit 1712863 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1712863 ]

        LUCENE-6814: release heap in PatternTokenizer.close

        Show
        ASF subversion and git services added a comment - Commit 1712863 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1712863 ] LUCENE-6814 : release heap in PatternTokenizer.close
        Hide
        ASF subversion and git services added a comment -

        Commit 1712865 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1712865 ]

        LUCENE-6814: release heap in PatternTokenizer.close

        Show
        ASF subversion and git services added a comment - Commit 1712865 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1712865 ] LUCENE-6814 : release heap in PatternTokenizer.close
        Hide
        Michael McCandless added a comment -

        If we ever do another 4.10.x release we should backport this one ...

        Show
        Michael McCandless added a comment - If we ever do another 4.10.x release we should backport this one ...

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development