Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9330

Race condition between core reload and statistics request

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 5.5
    • Fix Version/s: 6.3, master (7.0)
    • Component/s: None
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      Things happened that we execute this two requests consecutively in Solr 5.5:

      • Core reload: /admin/cores?action=RELOAD&core=coreName
      • Check core statistics: /coreName/admin/mbeans?stats=true

      And sometimes second request ends with this error:

      
      ERROR org.apache.solr.servlet.HttpSolrCall - null:org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
      	at org.apache.lucene.index.IndexReader.ensureOpen(IndexReader.java:274)
      	at org.apache.lucene.index.StandardDirectoryReader.getVersion(StandardDirectoryReader.java:331)
      	at org.apache.lucene.index.FilterDirectoryReader.getVersion(FilterDirectoryReader.java:119)
      	at org.apache.lucene.index.FilterDirectoryReader.getVersion(FilterDirectoryReader.java:119)
      	at org.apache.solr.search.SolrIndexSearcher.getStatistics(SolrIndexSearcher.java:2404)
      	at org.apache.solr.handler.admin.SolrInfoMBeanHandler.addMBean(SolrInfoMBeanHandler.java:164)
      	at org.apache.solr.handler.admin.SolrInfoMBeanHandler.getMBeanInfo(SolrInfoMBeanHandler.java:134)
      	at org.apache.solr.handler.admin.SolrInfoMBeanHandler.handleRequestBody(SolrInfoMBeanHandler.java:65)
      	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:155)
      	at org.apache.solr.core.SolrCore.execute(SolrCore.java:2082)
      	at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:670)
      	at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:458)
      	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:225)
      	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:183)
      
      

      If my understanding of SolrCore internals is correct, it happens because of async nature of reload request:

      • New searcher is "registered" in separate thread
      • Old searcher is closed in same separate thread and only after new one is registered
      • When old searcher is closing, it removes itself from map with MBeans
      • If statistic requests happens before old searcher is completely removed from everywhere - exception can happen.

      What do you think if we will introduce new parameter for reload request which makes it fully synchronized? Basically it will force it to call

        SolrCore#getSearcher(boolean forceNew, boolean returnSearcher, final Future[] waitSearcher, boolean updateHandlerReopens) 

      with waitSearcher!= null

      1. SOLR-9330.patch
        6 kB
        Mike Drob
      2. SOLR-9390.patch
        9 kB
        Mikhail Khludnev
      3. SOLR-9390.patch
        7 kB
        Andrey Kudryavtsev
      4. SOLR-9390.patch
        9 kB
        Mikhail Khludnev
      5. SOLR-9390.patch
        6 kB
        Mikhail Khludnev
      6. too_sync.patch
        11 kB
        Andrey Kudryavtsev

        Issue Links

          Activity

          Hide
          joel.bernstein Joel Bernstein added a comment - - edited

          We are currently looking at exactly this issue with Alfresco's Solr integration. So we'll be happy to help find the right fix for this.

          Show
          joel.bernstein Joel Bernstein added a comment - - edited We are currently looking at exactly this issue with Alfresco's Solr integration. So we'll be happy to help find the right fix for this.
          Hide
          mdrob Mike Drob added a comment -

          I ran into a similar problem recently as well. It can also happen on shutdown or regular core closing as well.

          Is there a reason that getVersion needs to ensureOpen first? I noticed that numDocs and maxDoc don't do this (comment claims for performance reasons) and they are called earlier in the same getStatistics method.

          If we can't remove ensureOpen from StandardDirectoryReader::getVersion, then we can also work around this by checking it ourselves. That would look like:

          SolrIndexSearcher.java
          -    lst.add("indexVersion", reader.getVersion());
          +    if (reader.getRefCount() > 0) lst.add("indexVersion", reader.getVersion());
          

          I don't know what the full effects of missing an indexVersion field in the JMX stats will be, but I am fairly confident that it will be less damaging than the currently thrown Exception.

          Joel Bernstein - Do you have any thoughts on which approach to take? Happy to put up a patch for either one.

          Show
          mdrob Mike Drob added a comment - I ran into a similar problem recently as well. It can also happen on shutdown or regular core closing as well. Is there a reason that getVersion needs to ensureOpen first? I noticed that numDocs and maxDoc don't do this (comment claims for performance reasons) and they are called earlier in the same getStatistics method. If we can't remove ensureOpen from StandardDirectoryReader::getVersion , then we can also work around this by checking it ourselves. That would look like: SolrIndexSearcher.java - lst.add( "indexVersion" , reader.getVersion()); + if (reader.getRefCount() > 0) lst.add( "indexVersion" , reader.getVersion()); I don't know what the full effects of missing an indexVersion field in the JMX stats will be, but I am fairly confident that it will be less damaging than the currently thrown Exception. Joel Bernstein - Do you have any thoughts on which approach to take? Happy to put up a patch for either one.
          Hide
          mkhludnev Mikhail Khludnev added a comment -

          Here is straightforward test reproducing the race, it catches exception at least at every third run at my laptop. It might be dirty yet, nevertheless.
          I wonder if I need to mark it @Nightly if it run for 10 sec?

          Show
          mkhludnev Mikhail Khludnev added a comment - Here is straightforward test reproducing the race, it catches exception at least at every third run at my laptop. It might be dirty yet, nevertheless. I wonder if I need to mark it @Nightly if it run for 10 sec?
          Hide
          mkhludnev Mikhail Khludnev added a comment -

          I made a few common sense/best effort changes, but nothing beside try{}catch() helps. Perhaps it might be synchronized properly with solrCore._searcherLock but it's too much changes I suppose.

          I'll try to just cache reader.version or whole reader stats since it shouldn't be changed during searcher life (will check.)

          Show
          mkhludnev Mikhail Khludnev added a comment - I made a few common sense/best effort changes, but nothing beside try{}catch() helps. Perhaps it might be synchronized properly with solrCore._searcherLock but it's too much changes I suppose. I'll try to just cache reader.version or whole reader stats since it shouldn't be changed during searcher life (will check.)
          Hide
          mdrob Mike Drob added a comment -

          Mikhail Khludnev - that looks like a good start to a test. I do not think 10 seconds is long enough to relegate something to @Nightly runs.

          I'm putting up an alternative fix that seems to also make your test pass. I think I like my approach better because it may still return partial results when able, but I'm also not convinced that there won't be other handlers that suffer from similar problems that are still not addressed. Your approach may be able to fix everything at once.

          Maybe the best solution is to do both things.

          Show
          mdrob Mike Drob added a comment - Mikhail Khludnev - that looks like a good start to a test. I do not think 10 seconds is long enough to relegate something to @Nightly runs. I'm putting up an alternative fix that seems to also make your test pass. I think I like my approach better because it may still return partial results when able, but I'm also not convinced that there won't be other handlers that suffer from similar problems that are still not addressed. Your approach may be able to fix everything at once. Maybe the best solution is to do both things.
          Hide
          werder Andrey Kudryavtsev added a comment - - edited

          If we are talking about root cause of the race - it looks like things are a little bit more easily than I thought. There is just another thread executor in CoreAdminHandler. If there will be API to avoid it, reload operation will become more synchronous (patch is attached).

          I thought that it will require to do something like too_sync.patch, but not this time probably.

          Show
          werder Andrey Kudryavtsev added a comment - - edited If we are talking about root cause of the race - it looks like things are a little bit more easily than I thought. There is just another thread executor in CoreAdminHandler. If there will be API to avoid it, reload operation will become more synchronous (patch is attached). I thought that it will require to do something like too_sync.patch, but not this time probably.
          Hide
          mkhludnev Mikhail Khludnev added a comment -

          Mike Drob
          my concern regarding

            if (reader.getRefCount() > 0) {
           /// and here core is reloaded in parallel and closes the reader. I think it can be reproduced with sleep()
               lst.add("indexVersion", reader.getVersion());
          }
          

          Can you also share any observation regarding 'jmx exception on shutdown' or perhaps you have a test? Is it the same case or it's something different?

          Show
          mkhludnev Mikhail Khludnev added a comment - Mike Drob my concern regarding if (reader.getRefCount() > 0) { /// and here core is reloaded in parallel and closes the reader. I think it can be reproduced with sleep() lst.add( "indexVersion" , reader.getVersion()); } Can you also share any observation regarding 'jmx exception on shutdown' or perhaps you have a test? Is it the same case or it's something different?
          Hide
          mdrob Mike Drob added a comment -

          You're right about there still being a race condition in my solution. For some reason I wasn't thinking about what happens in that case.

          re: jmx exception - yea, it's the same thing. we have a custom jmx handler on top of the mbean endpoint that has run into this same issue. I should have cleaned up the commit message there.

          Andrey Kudryavtsev, the reload operation isn't the only place where this can throw exceptions - it can also happen during core delete or during shutdown. So if we add synchronization, we will need to look at CoreContainer::shutdown, SolrCores::close, etc. I don't have tests written for all of these, but it's a nearly identical stack trace, given that reload is essentially unload + load.

          Show
          mdrob Mike Drob added a comment - You're right about there still being a race condition in my solution. For some reason I wasn't thinking about what happens in that case. re: jmx exception - yea, it's the same thing. we have a custom jmx handler on top of the mbean endpoint that has run into this same issue. I should have cleaned up the commit message there. Andrey Kudryavtsev , the reload operation isn't the only place where this can throw exceptions - it can also happen during core delete or during shutdown. So if we add synchronization, we will need to look at CoreContainer::shutdown , SolrCores::close , etc. I don't have tests written for all of these, but it's a nearly identical stack trace, given that reload is essentially unload + load.
          Hide
          werder Andrey Kudryavtsev added a comment -

          Mike Drob,

          Off course, if you are going to execute in two separate (clients) threads reload() and getStatistics() or deleteCore() and getStatistics() - you might still have trouble, and solving this races will require additional internal (probably complicated) synchronization (which by the way will affect all other cases when you don't really care about races executions)

          But in my case I can do this synchronization on client side. I mean just don't send statistics requests when core is still reloading (I don't care about reloading core statistics). The only thing I need for this - to know that reloading is still in progress. Some ability to make reload() fully synchronized operation. Does anybody know why it is async by default by any chance?

          Show
          werder Andrey Kudryavtsev added a comment - Mike Drob , Off course, if you are going to execute in two separate (clients) threads reload() and getStatistics() or deleteCore() and getStatistics() - you might still have trouble, and solving this races will require additional internal (probably complicated) synchronization (which by the way will affect all other cases when you don't really care about races executions) But in my case I can do this synchronization on client side. I mean just don't send statistics requests when core is still reloading (I don't care about reloading core statistics). The only thing I need for this - to know that reloading is still in progress. Some ability to make reload() fully synchronized operation. Does anybody know why it is async by default by any chance?
          Hide
          mkhludnev Mikhail Khludnev added a comment -

          The best way to win the race is to avoid it. Please check SOLR-9390.patch and expose your concern, I reduced sweeping fixes, but left one useless note commented (regarding removing from infoRegistry), let me know if you disagree to keep it in source. I think it's ready to be committed this week.

          Show
          mkhludnev Mikhail Khludnev added a comment - The best way to win the race is to avoid it. Please check SOLR-9390.patch and expose your concern, I reduced sweeping fixes, but left one useless note commented (regarding removing from infoRegistry), let me know if you disagree to keep it in source. I think it's ready to be committed this week.
          Hide
          mkhludnev Mikhail Khludnev added a comment -

          Mike Drob, would you mind to review the last patch?

          Show
          mkhludnev Mikhail Khludnev added a comment - Mike Drob , would you mind to review the last patch?
          Hide
          mdrob Mike Drob added a comment -
              lst.add("searcherName", name);
              lst.add("caching", cachingEnabled);
              lst.add("openedAt", openTime);
              if (registerTime != null) lst.add("registeredAt", registerTime);
              lst.add("warmupTime", warmupTime);
          

          Why not put these in the cached list as well? The first three are final and available before your call to snapStatistics. The last two are set during register which should only be called once, if I understand this correctly. Then the whole method becomes return readerStats; – much simpler and probably faster too!

          +    // core.getInfoRegistry().remove(STATISTICS_KEY, this);
          +    // decided to comment it, because it might upset users by showing stats, w/o "searcher" entry
          

          I don't think there is any reason to keep this in.

          Other than those minor points, the patch looks good to me.

          I've had a similar issue when calling /replication?command=details, but am not able to reproduce it in this test, so I think we're fine to handle that later.

          Show
          mdrob Mike Drob added a comment - lst.add( "searcherName" , name); lst.add( "caching" , cachingEnabled); lst.add( "openedAt" , openTime); if (registerTime != null ) lst.add( "registeredAt" , registerTime); lst.add( "warmupTime" , warmupTime); Why not put these in the cached list as well? The first three are final and available before your call to snapStatistics . The last two are set during register which should only be called once, if I understand this correctly. Then the whole method becomes return readerStats; – much simpler and probably faster too! + // core.getInfoRegistry().remove(STATISTICS_KEY, this ); + // decided to comment it, because it might upset users by showing stats, w/o "searcher" entry I don't think there is any reason to keep this in. Other than those minor points, the patch looks good to me. I've had a similar issue when calling /replication?command=details , but am not able to reproduce it in this test, so I think we're fine to handle that later.
          Hide
          mdrob Mike Drob added a comment -

          Ah, I see where the difference is, yes. In my case, the client process getting the statistics is an external monitoring application that gets them every 15 seconds and charts them. Since number of replicas can move, grow and shrink to accommodate usage, solving races like this is a very complicated problem. And at the end of the day, I don't care if my monitoring system misses one round of statistics, I'm more concerned about scary exceptions in the log that the ops team has to deal with.

          Show
          mdrob Mike Drob added a comment - Ah, I see where the difference is, yes. In my case, the client process getting the statistics is an external monitoring application that gets them every 15 seconds and charts them. Since number of replicas can move, grow and shrink to accommodate usage, solving races like this is a very complicated problem. And at the end of the day, I don't care if my monitoring system misses one round of statistics, I'm more concerned about scary exceptions in the log that the ops team has to deal with.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit b50b9106f821915ced14a3efc1e09c265d648fa8 in lucene-solr's branch refs/heads/master from Mikhail Khludnev
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b50b910 ]

          SOLR-9330: Fix AlreadyClosedException on admin/mbeans?stats=true

          Show
          jira-bot ASF subversion and git services added a comment - Commit b50b9106f821915ced14a3efc1e09c265d648fa8 in lucene-solr's branch refs/heads/master from Mikhail Khludnev [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b50b910 ] SOLR-9330 : Fix AlreadyClosedException on admin/mbeans?stats=true
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 4278356564d4928dfabfde6d4a2ebf57dcfcf81f in lucene-solr's branch refs/heads/branch_6x from Mikhail Khludnev
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4278356 ]

          SOLR-9330: Fix AlreadyClosedException on admin/mbeans?stats=true

          Show
          jira-bot ASF subversion and git services added a comment - Commit 4278356564d4928dfabfde6d4a2ebf57dcfcf81f in lucene-solr's branch refs/heads/branch_6x from Mikhail Khludnev [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4278356 ] SOLR-9330 : Fix AlreadyClosedException on admin/mbeans?stats=true
          Hide
          mkhludnev Mikhail Khludnev added a comment -

          Exception should gone. Sometimes response contains no "search" entry, but only something like Searcher@deadbeaf [collection1] main. So, patterns consuming stats response might, but shouldn't be surprised.

          Show
          mkhludnev Mikhail Khludnev added a comment - Exception should gone. Sometimes response contains no "search" entry, but only something like Searcher@deadbeaf [collection1] main . So, patterns consuming stats response might, but shouldn't be surprised.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Closing after 6.3.0 release.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Closing after 6.3.0 release.

            People

            • Assignee:
              mkhludnev Mikhail Khludnev
              Reporter:
              werder Andrey Kudryavtsev
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development