Solr
  1. Solr
  2. SOLR-2927

SolrIndexSearcher's register do not match close and SolrCore's closeSearcher

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.10.3, 5.0, 6.0
    • Component/s: search
    • Labels:
      None
    • Environment:

      JDK1.6/CentOS

      Description

      1. SolrIndexSearcher's register method put the name of searcher, but SolrCore's closeSearcher method remove name of currentSearcher on infoRegistry.
      2. SolrIndexSearcher's register method put the name of cache, but SolrIndexSearcher's close do not remove the name of cache.

      so, there maybe lost some memory leak.

      1. mbean-leak-jira.png
        336 kB
        Cyrille Roy
      2. SOLR-2927.patch
        2 kB
        Shalin Shekhar Mangar

        Activity

        Hide
        Steve Rowe added a comment -

        Bulk move 4.4 issues to 4.5 and 5.0

        Show
        Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
        Hide
        Shalin Shekhar Mangar added a comment -
        1. SolrIndexSearcher's close method removes the name of the searcher so this is not a problem.
        2. SolrCore's close clears the infoRegistry so I don't think we hold on to any stats object

        Are you seeing memory leaks in your Solr process?

        Show
        Shalin Shekhar Mangar added a comment - SolrIndexSearcher's close method removes the name of the searcher so this is not a problem. SolrCore's close clears the infoRegistry so I don't think we hold on to any stats object Are you seeing memory leaks in your Solr process?
        Hide
        Sharath Babu added a comment -

        SolrCore instantiates its SolrIndexSearcher asynchronously. If SolrCore hits an exception and closes before said SolrIndexSearcher is asynchronously opened, these mbeans will be registered after SolrCore clears its registry, so these mbeans would be leaked.

        Show
        Sharath Babu added a comment - SolrCore instantiates its SolrIndexSearcher asynchronously. If SolrCore hits an exception and closes before said SolrIndexSearcher is asynchronously opened, these mbeans will be registered after SolrCore clears its registry, so these mbeans would be leaked.
        Hide
        Uwe Schindler added a comment -

        Move issue to Solr 4.9.

        Show
        Uwe Schindler added a comment - Move issue to Solr 4.9.
        Hide
        Cyrille Roy added a comment - - edited

        I have been able to reproduce this issue patching the code to throw an exception in SolrCore in branch 4_2 in constructor line 875
        ...
        resourceLoader.inform(resourceLoader);
        //DO NOT COMMIT THIS:
        if(!"metadata".equals(name)) throw new RuntimeException("test exception");
        ...

        you can then curl any core
        $curl "http://localhost:xxx/solr/CORE_NAME/select?q=:"

        open a jconsole and you will see the leaking mbean named "solr/CORE_NAME"

        Show
        Cyrille Roy added a comment - - edited I have been able to reproduce this issue patching the code to throw an exception in SolrCore in branch 4_2 in constructor line 875 ... resourceLoader.inform(resourceLoader); //DO NOT COMMIT THIS: if(!"metadata".equals(name)) throw new RuntimeException("test exception"); ... you can then curl any core $curl "http://localhost:xxx/solr/CORE_NAME/select?q= : " open a jconsole and you will see the leaking mbean named "solr/CORE_NAME"
        Hide
        Cyrille Roy added a comment - - edited

        proposed patch: in SolrCore.close() start with waiting for searcherExecutor and then empty the infoRegistry which will unregister the mbean.
        Patch is built against trunk
        Please let me know if it is not the right version to build a patch.

        Show
        Cyrille Roy added a comment - - edited proposed patch: in SolrCore.close() start with waiting for searcherExecutor and then empty the infoRegistry which will unregister the mbean. Patch is built against trunk Please let me know if it is not the right version to build a patch.
        Hide
        Michael Dodsworth added a comment -

        Shalin Shekhar Mangar any feedback on this?

        Show
        Michael Dodsworth added a comment - Shalin Shekhar Mangar any feedback on this?
        Hide
        Shalin Shekhar Mangar added a comment -

        Thanks for pinging me Michael. This issue had been forgotten.

        I now understand the bug and I am able to reproduce it locally. I started with Cyrille's patch which introduced an exception in the SolrCore constructor and I added logging of all items which are added to JMX and all the items that are removed on close after the exception. With a little bit of awk and sort, I have this list of mbeans which are leaked:

        documentCache
        fieldValueCache
        filterCache
        mlt
        perSegFilter
        query
        queryResultCache
        searcher
        Searcher@778e65f2[techproducts]
        
        Show
        Shalin Shekhar Mangar added a comment - Thanks for pinging me Michael. This issue had been forgotten. I now understand the bug and I am able to reproduce it locally. I started with Cyrille's patch which introduced an exception in the SolrCore constructor and I added logging of all items which are added to JMX and all the items that are removed on close after the exception. With a little bit of awk and sort, I have this list of mbeans which are leaked: documentCache fieldValueCache filterCache mlt perSegFilter query queryResultCache searcher Searcher@778e65f2[techproducts]
        Hide
        Cyrille Roy added a comment - - edited

        hi Shalin Shekhar Mangar,
        this is great news.
        I have some of the mbeans leaking in the attached mbean-jira-leaked.png that I have attached.
        Also, I have been able to fix the leak with the SOLR-2927.patch attached.
        Do you still see the leaking mbeans after applying the patch ?

        The issue is a race condition.
        when opening a core, solr is preventing searchers from firing by starting a thread that blocks in a single threaded executor at line 783:
        searcherExecutor.submit(new Callable() {
        @Override
        public Object call() throws Exception

        { latch.await(); return null; }

        });

        Other events are queued using the thread executor line 864 when calling the getSearcher(false, false, null, true) method.
        getSearcher calls registerSearcher which calls register which registers mbeans using the getInfoRegistry().put(key, value) method

        on exception, the latch is released, so searcher threads can fire and write things in the mbean using the infoRegistry map.
        } catch (Throwable e) {
        latch.countDown();//release the latch, otherwise we block trying to do the close. This should be fine, since counting down on a latch of 0 is still fine
        //close down the searcher and any other resources, if it exists, as this is not recoverable
        close();

        And indeed it is what is happening in close()

        we clean the mbean at line 990
        try

        { infoRegistry.clear(); }

        catch (Throwable e)

        { SolrException.log(log, e); }

        and then wait for searcher thread completion at line 1070
        try { ExecutorUtil.shutdownAndAwaitTermination(searcherExecutor); } catch (Throwable e) { SolrException.log(log, e); }

        putting this before the registry clean actually solves the issue.

        Show
        Cyrille Roy added a comment - - edited hi Shalin Shekhar Mangar , this is great news. I have some of the mbeans leaking in the attached mbean-jira-leaked.png that I have attached. Also, I have been able to fix the leak with the SOLR-2927 .patch attached. Do you still see the leaking mbeans after applying the patch ? The issue is a race condition. when opening a core, solr is preventing searchers from firing by starting a thread that blocks in a single threaded executor at line 783: searcherExecutor.submit(new Callable() { @Override public Object call() throws Exception { latch.await(); return null; } }); Other events are queued using the thread executor line 864 when calling the getSearcher(false, false, null, true) method. getSearcher calls registerSearcher which calls register which registers mbeans using the getInfoRegistry().put(key, value) method on exception, the latch is released, so searcher threads can fire and write things in the mbean using the infoRegistry map. } catch (Throwable e) { latch.countDown();//release the latch, otherwise we block trying to do the close. This should be fine, since counting down on a latch of 0 is still fine //close down the searcher and any other resources, if it exists, as this is not recoverable close(); And indeed it is what is happening in close() we clean the mbean at line 990 try { infoRegistry.clear(); } catch (Throwable e) { SolrException.log(log, e); } and then wait for searcher thread completion at line 1070 try { ExecutorUtil.shutdownAndAwaitTermination(searcherExecutor); } catch (Throwable e) { SolrException.log(log, e); } putting this before the registry clean actually solves the issue.
        Hide
        Shalin Shekhar Mangar added a comment -

        I still see two mbeans leaked after moving infoRegistry.clear after the executor has been shutdown and the searcher is closed.

        The reason is that two components, MLTQParserPlugin and NestedQParserPlugin, use the same key name as QueryComponent and MLTComponent respectively while registering themselves. JMX is perfectly fine accepting mbeans with the same key because we use the key as the "type" and the mbean name (usually the class name) as the mbean "id" (it should have been the other way around but I didn't know enough about JMX when I wrote the original impl) but the map loses track of these mbeans and doesn't un-register them. We can't change the name of these parsers because it breaks back-compat.

        In this patch, I used Cyrille's suggestion to move the clear after searcher is shutdown. I also use a JMX query to get all registered mbeans and un-register them. This gets rid of all leaks.

        Show
        Shalin Shekhar Mangar added a comment - I still see two mbeans leaked after moving infoRegistry.clear after the executor has been shutdown and the searcher is closed. The reason is that two components, MLTQParserPlugin and NestedQParserPlugin, use the same key name as QueryComponent and MLTComponent respectively while registering themselves. JMX is perfectly fine accepting mbeans with the same key because we use the key as the "type" and the mbean name (usually the class name) as the mbean "id" (it should have been the other way around but I didn't know enough about JMX when I wrote the original impl) but the map loses track of these mbeans and doesn't un-register them. We can't change the name of these parsers because it breaks back-compat. In this patch, I used Cyrille's suggestion to move the clear after searcher is shutdown. I also use a JMX query to get all registered mbeans and un-register them. This gets rid of all leaks.
        Hide
        Cyrille Roy added a comment -

        great, thanks Shalin Shekhar Mangar.
        I tested the fix I provided against solr-4.2 branch, maybe that's why I did not have the 2 extra leaking mbeans.

        Show
        Cyrille Roy added a comment - great, thanks Shalin Shekhar Mangar . I tested the fix I provided against solr-4.2 branch, maybe that's why I did not have the 2 extra leaking mbeans.
        Hide
        ASF subversion and git services added a comment -

        Commit 1636603 from shalin@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1636603 ]

        SOLR-2927: Solr does not unregister all mbeans upon exception in constructor causing memory leaks

        Show
        ASF subversion and git services added a comment - Commit 1636603 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1636603 ] SOLR-2927 : Solr does not unregister all mbeans upon exception in constructor causing memory leaks
        Hide
        ASF subversion and git services added a comment -

        Commit 1636604 from shalin@apache.org in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1636604 ]

        SOLR-2927: Solr does not unregister all mbeans upon exception in constructor causing memory leaks

        Show
        ASF subversion and git services added a comment - Commit 1636604 from shalin@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1636604 ] SOLR-2927 : Solr does not unregister all mbeans upon exception in constructor causing memory leaks
        Hide
        Shalin Shekhar Mangar added a comment -

        I tested the fix I provided against solr-4.2 branch, maybe that's why I did not have the 2 extra leaking mbeans.

        Yes, both of those qparsers are newer than 4.2.

        This is fixed in trunk and branch_5x. Thanks Tom, Sharath, Michael and Cyrille!

        Show
        Shalin Shekhar Mangar added a comment - I tested the fix I provided against solr-4.2 branch, maybe that's why I did not have the 2 extra leaking mbeans. Yes, both of those qparsers are newer than 4.2. This is fixed in trunk and branch_5x. Thanks Tom, Sharath, Michael and Cyrille!
        Hide
        Michael Dodsworth added a comment -

        Thanks, Shalin Shekhar Mangar. Much appreciated.

        Show
        Michael Dodsworth added a comment - Thanks, Shalin Shekhar Mangar . Much appreciated.
        Hide
        Cyrille Roy added a comment -

        Thanks a lot Shalin Shekhar Mangar

        Show
        Cyrille Roy added a comment - Thanks a lot Shalin Shekhar Mangar
        Hide
        Cyrille Roy added a comment -

        I removed the initial patch I provided as Shalin Shekhar Mangar enriched it for branch_5X

        Show
        Cyrille Roy added a comment - I removed the initial patch I provided as Shalin Shekhar Mangar enriched it for branch_5X
        Hide
        Shalin Shekhar Mangar added a comment -

        I synced the trunk code and have some failures.
        I don't know how to check for trunk status other that rebuilding it (I did not find an url where we can check for trunk status)

        What failures have you seen?

        I wonder if it would be better to wait for searcherExecutor shutdown sooner than later because this thread can keep on doing some work whereas we are supposed to close the core.

        Once the searcherExecutor is shutdown, there won't be any searcher thread doing any work (including warming). The on deck searcher is also closed before we clear the infoRegistry.

        Show
        Shalin Shekhar Mangar added a comment - I synced the trunk code and have some failures. I don't know how to check for trunk status other that rebuilding it (I did not find an url where we can check for trunk status) What failures have you seen? I wonder if it would be better to wait for searcherExecutor shutdown sooner than later because this thread can keep on doing some work whereas we are supposed to close the core. Once the searcherExecutor is shutdown, there won't be any searcher thread doing any work (including warming). The on deck searcher is also closed before we clear the infoRegistry.
        Hide
        Cyrille Roy added a comment - - edited

        hi Shalin Shekhar Mangar I have some questions regarding the patch you provided.
        I backported the fix into branch_4_2 and have some test failures.

        I updated the patch to have right after the closeHooks (line 1070)

        • ExecutorUtil.shutdown.AndAwaitTermination block
        • infoRegistry block
          With that, I have no more failures.

        I wonder if it would be better to wait for searcherExecutor shutdown sooner than later because this thread can keep on doing some work whereas we are supposed to close the core.
        Thanks

        Show
        Cyrille Roy added a comment - - edited hi Shalin Shekhar Mangar I have some questions regarding the patch you provided. I backported the fix into branch_4_2 and have some test failures. I updated the patch to have right after the closeHooks (line 1070) ExecutorUtil.shutdown.AndAwaitTermination block infoRegistry block With that, I have no more failures. I wonder if it would be better to wait for searcherExecutor shutdown sooner than later because this thread can keep on doing some work whereas we are supposed to close the core. Thanks
        Hide
        Cyrille Roy added a comment -

        [junit4:junit4] Tests with failures (first 10 out of 11):
        [junit4:junit4] - org.apache.solr.MinimalSchemaTest (suite)
        [junit4:junit4] - org.apache.solr.cloud.SyncSliceTest (suite)
        [junit4:junit4] - org.apache.solr.cloud.UnloadDistributedZkTest (suite)
        [junit4:junit4] - org.apache.solr.cloud.AliasIntegrationTest (suite)
        [junit4:junit4] - org.apache.solr.cloud.BasicDistributedZk2Test (suite)
        [junit4:junit4] - org.apache.solr.cloud.FullSolrCloudDistribCmdsTest (suite)
        [junit4:junit4] - org.apache.solr.cloud.CollectionsAPIDistributedZkTest (suite)
        [junit4:junit4] - org.apache.solr.cloud.ShardRoutingTest (suite)
        [junit4:junit4] - org.apache.solr.cloud.RecoveryZkTest (suite)
        [junit4:junit4] - org.apache.solr.handler.admin.ShowFileRequestHandlerTest (suite)
        [junit4:junit4]
        [junit4:junit4]
        [junit4:junit4] JVM J0: 2.84 .. 890.21 = 887.37s
        [junit4:junit4] JVM J1: 2.84 .. 767.84 = 765.00s
        [junit4:junit4] JVM J2: 2.84 .. 881.17 = 878.34s
        [junit4:junit4] JVM J3: 3.09 .. 768.09 = 765.01s
        [junit4:junit4] Execution time total: 14 minutes 50 seconds
        [junit4:junit4] Tests summary: 272 suites, 1148 tests, 11 suite-level errors, 14 ignored (2 assumptions)

        I have often this stack:
        [junit4:junit4] 2> 69753 T293 oasc.JmxMonitoredMap$SolrDynamicMBean.getMBeanInfo WARNING Could not getStatistics on info bean org.apache.solr.search.SolrIndexSearcher org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed

        this is a warning, but it is a class updated by the patch so mentioning it.

        Show
        Cyrille Roy added a comment - [junit4:junit4] Tests with failures (first 10 out of 11): [junit4:junit4] - org.apache.solr.MinimalSchemaTest (suite) [junit4:junit4] - org.apache.solr.cloud.SyncSliceTest (suite) [junit4:junit4] - org.apache.solr.cloud.UnloadDistributedZkTest (suite) [junit4:junit4] - org.apache.solr.cloud.AliasIntegrationTest (suite) [junit4:junit4] - org.apache.solr.cloud.BasicDistributedZk2Test (suite) [junit4:junit4] - org.apache.solr.cloud.FullSolrCloudDistribCmdsTest (suite) [junit4:junit4] - org.apache.solr.cloud.CollectionsAPIDistributedZkTest (suite) [junit4:junit4] - org.apache.solr.cloud.ShardRoutingTest (suite) [junit4:junit4] - org.apache.solr.cloud.RecoveryZkTest (suite) [junit4:junit4] - org.apache.solr.handler.admin.ShowFileRequestHandlerTest (suite) [junit4:junit4] [junit4:junit4] [junit4:junit4] JVM J0: 2.84 .. 890.21 = 887.37s [junit4:junit4] JVM J1: 2.84 .. 767.84 = 765.00s [junit4:junit4] JVM J2: 2.84 .. 881.17 = 878.34s [junit4:junit4] JVM J3: 3.09 .. 768.09 = 765.01s [junit4:junit4] Execution time total: 14 minutes 50 seconds [junit4:junit4] Tests summary: 272 suites, 1148 tests, 11 suite-level errors, 14 ignored (2 assumptions) I have often this stack: [junit4:junit4] 2> 69753 T293 oasc.JmxMonitoredMap$SolrDynamicMBean.getMBeanInfo WARNING Could not getStatistics on info bean org.apache.solr.search.SolrIndexSearcher org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed this is a warning, but it is a class updated by the patch so mentioning it.
        Hide
        Cyrille Roy added a comment - - edited

        the deadlock in our 4_2 branch occurs when doing infoRegistry.​clear(); after close​Searcher().
        I'll integrate the patch taking this into account.
        Thanks again Shalin Shekhar Mangar for the great work and feedback.

        Show
        Cyrille Roy added a comment - - edited the deadlock in our 4_2 branch occurs when doing infoRegistry.​clear(); after close​Searcher(). I'll integrate the patch taking this into account. Thanks again Shalin Shekhar Mangar for the great work and feedback.
        Hide
        Shalin Shekhar Mangar added a comment -

        Reopening to backport in 4.10.3

        Show
        Shalin Shekhar Mangar added a comment - Reopening to backport in 4.10.3
        Hide
        ASF subversion and git services added a comment -

        Commit 1642728 from shalin@apache.org in branch 'dev/branches/lucene_solr_4_10'
        [ https://svn.apache.org/r1642728 ]

        SOLR-2927: Solr does not unregister all mbeans upon exception in constructor causing memory leaks

        Show
        ASF subversion and git services added a comment - Commit 1642728 from shalin@apache.org in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1642728 ] SOLR-2927 : Solr does not unregister all mbeans upon exception in constructor causing memory leaks
        Hide
        Anshum Gupta added a comment -

        Bulk close after 5.0 release.

        Show
        Anshum Gupta added a comment - Bulk close after 5.0 release.

          People

          • Assignee:
            Shalin Shekhar Mangar
            Reporter:
            tom liu
          • Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development