Solr
  1. Solr
  2. SOLR-7408

Let SolrCore be the only thing which registers/unregisters a config directory listener

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.2, 6.0
    • Component/s: SolrCloud
    • Labels:
      None
    • Flags:
      Patch

      Description

      As reported here: http://markmail.org/message/ynkm2axkdprppgef, there is a race condition which results in an exception when creating multiple collections over the same config set. I was able to reproduce it in a test, although I am only able to reproduce if I put break points and manually simulate the problematic context switches.

      1. SOLR-7408.patch
        47 kB
        Shai Erera
      2. SOLR-7408.patch
        48 kB
        Shai Erera
      3. SOLR-7408.patch
        48 kB
        Shai Erera
      4. SOLR-7408.patch
        46 kB
        Shai Erera
      5. SOLR-7408.patch
        44 kB
        Shai Erera
      6. SOLR-7408.patch
        15 kB
        Shai Erera
      7. SOLR-7408.patch
        13 kB
        Shai Erera

        Activity

        Hide
        Shai Erera added a comment -

        The patch adds a test which reproduces the bug, however only if you place breakpoints and simulate the context switches manually. Therefore I'll explain the steps to reproduce here, and we can decide if we want to keep the test or remove it:

        • Put a breakpoint in ZkController.registerConfListenerForCore on synchronized (confDirectoryListeners) {.
        • Put another breakpoint in ZkController.unregister on synchronized (confDirectoryListeners) {.
        • Run the test in Debug and wait for the two threads to get to registerConfListenerForCore
          • Advance one of them until it reaches the other breakpoint in unregister.
          • Let it remove the entry from the map.
          • Now advance the second thread that's in registerConfigListenerForCore, it doesn't find the confDir in the map and throws the "conf directory is not valid" exception.

        What happens is that the last core which references a configDir is removed, at the same time a new core is created. Note that even if both threads call .watchZKConfDir, we still remove the entry from the map, and the current code fails w/ the exception.

        This patch partially fixes the problem, by moving the removal of the entry from the map into a sync'd block as well, guards the map in watchZKConfDir and re-registers a watch inside registerConfListenerForCore if one doesn't exist.

        However, this doesn't solve the entire problem. While now the application won't hit an exception, the above chain of events can cause the listener to be removed, after it has been registered. The reason is that SolrCore registers the listener in its constructor, however CoreContainer.getCores() (which is used by ZkController.getCores()) does not return that SolrCore until it has been registered in SolrCores.cores. That happens in CoreContainer.registerCore, line 460, after the SolrCore was already created and registered its listener.

        I think that the solution is to register the core listener only when we are ready to advertise that core, i.e. not until its put in SolrCore.cores. That is the only way to guarantee that ZkController.unregister will remove the confDir listener when all known/advertised cores don't reference it. And since registerConfListenerForCore re-adds the watch if one doesn't exist, we should be safe.

        However, this area of the code is completely new to me, so would appreciate a review. You don't really have to run the test, it only helped me understand what's going on.

        Show
        Shai Erera added a comment - The patch adds a test which reproduces the bug, however only if you place breakpoints and simulate the context switches manually. Therefore I'll explain the steps to reproduce here, and we can decide if we want to keep the test or remove it: Put a breakpoint in ZkController.registerConfListenerForCore on synchronized (confDirectoryListeners) { . Put another breakpoint in ZkController.unregister on synchronized (confDirectoryListeners) { . Run the test in Debug and wait for the two threads to get to registerConfListenerForCore Advance one of them until it reaches the other breakpoint in unregister . Let it remove the entry from the map. Now advance the second thread that's in registerConfigListenerForCore , it doesn't find the confDir in the map and throws the "conf directory is not valid" exception. What happens is that the last core which references a configDir is removed, at the same time a new core is created. Note that even if both threads call .watchZKConfDir , we still remove the entry from the map, and the current code fails w/ the exception. This patch partially fixes the problem, by moving the removal of the entry from the map into a sync'd block as well, guards the map in watchZKConfDir and re-registers a watch inside registerConfListenerForCore if one doesn't exist. However, this doesn't solve the entire problem. While now the application won't hit an exception, the above chain of events can cause the listener to be removed, after it has been registered. The reason is that SolrCore registers the listener in its constructor, however CoreContainer.getCores() (which is used by ZkController.getCores() ) does not return that SolrCore until it has been registered in SolrCores.cores . That happens in CoreContainer.registerCore , line 460, after the SolrCore was already created and registered its listener. I think that the solution is to register the core listener only when we are ready to advertise that core, i.e. not until its put in SolrCore.cores . That is the only way to guarantee that ZkController.unregister will remove the confDir listener when all known/advertised cores don't reference it. And since registerConfListenerForCore re-adds the watch if one doesn't exist, we should be safe. However, this area of the code is completely new to me, so would appreciate a review. You don't really have to run the test, it only helped me understand what's going on.
        Hide
        Shai Erera added a comment -

        Patch implements what I proposed: SorlCore.registerConfListener() is made public, no longer called by SolrCore() and now called by SolrCores.putCore after it is put in the map. This now prevents removing any listeners of known/advertised cores.

        Ideally, we would not create an entry in the map unless there is at least one listener for this confDir, and we would also remove the entry from the map when the last listener has been removed (when the last SolrCore is closed). However, it seems like we call unregister even in the event of errors (CoreAdminHandler.handleCreateAction), forcing the removal of the configDir listener, and it's not clear if we also call SolrCore.close(). So we may start leaking listeners that will never be removed ...

        If anyone has a better idea how to address that issue, please chime in.

        Show
        Shai Erera added a comment - Patch implements what I proposed: SorlCore.registerConfListener() is made public, no longer called by SolrCore() and now called by SolrCores.putCore after it is put in the map. This now prevents removing any listeners of known/advertised cores. Ideally, we would not create an entry in the map unless there is at least one listener for this confDir, and we would also remove the entry from the map when the last listener has been removed (when the last SolrCore is closed). However, it seems like we call unregister even in the event of errors ( CoreAdminHandler.handleCreateAction ), forcing the removal of the configDir listener, and it's not clear if we also call SolrCore.close() . So we may start leaking listeners that will never be removed ... If anyone has a better idea how to address that issue, please chime in.
        Hide
        Anshum Gupta added a comment - - edited

        Thanks Shai. The patch looks good to me in what it does.

        Though I think I understand what you're saying here, can you elaborate more on this?

        However, it seems like we call unregister even in the event of errors (CoreAdminHandler.handleCreateAction), forcing the removal of the configDir listener, and it's not clear if we also call SolrCore.close(). So we may start leaking listeners that will never be removed ..

        Show
        Anshum Gupta added a comment - - edited Thanks Shai. The patch looks good to me in what it does. Though I think I understand what you're saying here, can you elaborate more on this? However, it seems like we call unregister even in the event of errors (CoreAdminHandler.handleCreateAction), forcing the removal of the configDir listener, and it's not clear if we also call SolrCore.close(). So we may start leaking listeners that will never be removed ..
        Hide
        Shai Erera added a comment -

        Though I think I understand what you're saying here, can you elaborate more on this?

        If we wanted to change the code such that we put a listener in the map on a SolrCore creation, and remove it from the map on a SolrCore close, I believe we wouldn't be running into such concurrency issues. In a sense, this is what is done when all is good: a SolrCore puts a listener in its ctor, and removes it in its close().

        But if something goes wrong, we may leave dangling listeners, of SolrCore instances that no longer exist. This is what I believe (CoreAdminHandler.handleCreateAction attempts to do – if a core creation failed, it attempts to unregister all listeners of a configDir from the map, and lets unregister decide if the entry itself can be removed or not. This ensures that we won't be left w/ dangling listeners that will never be released - what I referred to as leaking listeners.

        The code in unregister relies on the same logic that introduces the bug – if there is core in SolrCores which references this configDir, remove all listeners. The problem is that a core registers a listener, before it is put in SolrCores, and hence the race condition.

        I would personally prefer that we stop removing all listeners, and let a core take care of itself, but I don't know how safe is Solr code in that regard. I.e. are all places that create a SolrCore clean up after it in the event of a failure? Clearly CoreAdminHandler.handleCreateAction doesn't, which got me thinking what other places don't do that as well.

        But, if we want to change the logic like that, we can certainly look at all the places that do new SolrCore(...) and make sure they call SolrCore.close() in the event of any failure.

        Show
        Shai Erera added a comment - Though I think I understand what you're saying here, can you elaborate more on this? If we wanted to change the code such that we put a listener in the map on a SolrCore creation, and remove it from the map on a SolrCore close, I believe we wouldn't be running into such concurrency issues. In a sense, this is what is done when all is good : a SolrCore puts a listener in its ctor, and removes it in its close(). But if something goes wrong , we may leave dangling listeners, of SolrCore instances that no longer exist. This is what I believe ( CoreAdminHandler.handleCreateAction attempts to do – if a core creation failed, it attempts to unregister all listeners of a configDir from the map, and lets unregister decide if the entry itself can be removed or not. This ensures that we won't be left w/ dangling listeners that will never be released - what I referred to as leaking listeners. The code in unregister relies on the same logic that introduces the bug – if there is core in SolrCores which references this configDir, remove all listeners. The problem is that a core registers a listener, before it is put in SolrCores, and hence the race condition. I would personally prefer that we stop removing all listeners, and let a core take care of itself, but I don't know how safe is Solr code in that regard. I.e. are all places that create a SolrCore clean up after it in the event of a failure? Clearly CoreAdminHandler.handleCreateAction doesn't, which got me thinking what other places don't do that as well. But, if we want to change the logic like that, we can certainly look at all the places that do new SolrCore(...) and make sure they call SolrCore.close() in the event of any failure.
        Hide
        Noble Paul added a comment -

        looks good to me

        Show
        Noble Paul added a comment - looks good to me
        Hide
        Anshum Gupta added a comment -

        I like the idea of calling SolrCore.close() and letting that handle the responsibility of unregistering (already happens).
        Does it make more sense to have this in a different JIRA or at least change the title/summary of this one to highlight the new/updated intention?

        Show
        Anshum Gupta added a comment - I like the idea of calling SolrCore.close() and letting that handle the responsibility of unregistering (already happens). Does it make more sense to have this in a different JIRA or at least change the title/summary of this one to highlight the new/updated intention?
        Hide
        Shai Erera added a comment -

        I will update the title of this JIRA and handle it here. I like this better than doing what I consider more of a hack to the code and later change it. SolrCore is initialized in two places, so shouldn't be complicated to ensure it is closed in case of errors.

        While I'm at it, I'll try to simplify the ctor by breaking it out to some auxiliary methods, instead of having a 250 lines method!

        Show
        Shai Erera added a comment - I will update the title of this JIRA and handle it here. I like this better than doing what I consider more of a hack to the code and later change it. SolrCore is initialized in two places, so shouldn't be complicated to ensure it is closed in case of errors. While I'm at it, I'll try to simplify the ctor by breaking it out to some auxiliary methods, instead of having a 250 lines method!
        Hide
        Anshum Gupta added a comment -

        +1 to both of those!

        Show
        Anshum Gupta added a comment - +1 to both of those!
        Hide
        Shai Erera added a comment -

        There are two ctors which aren't called by Solr code. Is SolrCore considered public API, and so we should keep these ctors, or can I remove them?

        Show
        Shai Erera added a comment - There are two ctors which aren't called by Solr code. Is SolrCore considered public API, and so we should keep these ctors, or can I remove them?
        Hide
        Shalin Shekhar Mangar added a comment -

        There are two ctors which aren't called by Solr code. Is SolrCore considered public API, and so we should keep these ctors, or can I remove them?

        Yes, it is public API because of EmbeddedSolrServer use-cases. We can deprecate them but we can't remove them. I will be on board if you want to remove the deprecated APIs after 1 minor release though.

        Show
        Shalin Shekhar Mangar added a comment - There are two ctors which aren't called by Solr code. Is SolrCore considered public API, and so we should keep these ctors, or can I remove them? Yes, it is public API because of EmbeddedSolrServer use-cases. We can deprecate them but we can't remove them. I will be on board if you want to remove the deprecated APIs after 1 minor release though.
        Hide
        Shalin Shekhar Mangar added a comment -

        I would personally prefer that we stop removing all listeners, and let a core take care of itself, but I don't know how safe is Solr code in that regard. I.e. are all places that create a SolrCore clean up after it in the event of a failure? Clearly CoreAdminHandler.handleCreateAction doesn't, which got me thinking what other places don't do that as well.

        Should we start ref counting the ZK watches?

        Show
        Shalin Shekhar Mangar added a comment - I would personally prefer that we stop removing all listeners, and let a core take care of itself, but I don't know how safe is Solr code in that regard. I.e. are all places that create a SolrCore clean up after it in the event of a failure? Clearly CoreAdminHandler.handleCreateAction doesn't, which got me thinking what other places don't do that as well. Should we start ref counting the ZK watches?
        Hide
        Shai Erera added a comment -

        Should we start ref counting the ZK watches?

        The confDir listeners are per SolrCore instance, so I don't think we should ref-count them. Each SolrCore will add/remove its own listener.

        Show
        Shai Erera added a comment - Should we start ref counting the ZK watches? The confDir listeners are per SolrCore instance, so I don't think we should ref-count them. Each SolrCore will add/remove its own listener.
        Hide
        Shai Erera added a comment -
        • Remove ZkController.watchZkDir
        • Remove code which cleaned the map in .unregister()
        • Ensures SolrCore is closed on any errors
        • Simplifies SolrCore ctor a bit by extracting logic to separate methods. More can be done I'm sure, by breaking out this class into several auxiliary classes, but not in the context of this issue.

        Tests pass.

        Show
        Shai Erera added a comment - Remove ZkController.watchZkDir Remove code which cleaned the map in .unregister() Ensures SolrCore is closed on any errors Simplifies SolrCore ctor a bit by extracting logic to separate methods. More can be done I'm sure, by breaking out this class into several auxiliary classes, but not in the context of this issue. Tests pass.
        Hide
        Shai Erera added a comment -

        If there are no objections, I will commit this tomorrow.

        Show
        Shai Erera added a comment - If there are no objections, I will commit this tomorrow.
        Hide
        Anshum Gupta added a comment -

        Shai, can you give me a just a few hours? I'd want to look at this change. I wanted to get to it earlier today but didn't get any time.

        Show
        Anshum Gupta added a comment - Shai, can you give me a just a few hours? I'd want to look at this change. I wanted to get to it earlier today but didn't get any time.
        Hide
        Shai Erera added a comment -

        Sure!

        Show
        Shai Erera added a comment - Sure!
        Hide
        Anshum Gupta added a comment - - edited

        Thanks for doing all of the cleanup while fixing the race condition! LGTM other than these points:

        • You have a nocommit about removal of the constructors, I'd say we should deprecate them for one release (at least).
        • Can you extract zkController.watchZKConfDir() back out, I think you moved it inline but it keeps things cleaner and makes more sense to have that as a method.
        • Also, shouldn't zkController.getConfigDirListener.OnReconnect() reset the watches on zk i.e. have a call to watchZKConfDir() added back?
        • configLocation is no longer used in unregister, so you might want to remove that too.
        Show
        Anshum Gupta added a comment - - edited Thanks for doing all of the cleanup while fixing the race condition! LGTM other than these points: You have a nocommit about removal of the constructors, I'd say we should deprecate them for one release (at least). Can you extract zkController.watchZKConfDir() back out, I think you moved it inline but it keeps things cleaner and makes more sense to have that as a method. Also, shouldn't zkController.getConfigDirListener.OnReconnect() reset the watches on zk i.e. have a call to watchZKConfDir() added back? configLocation is no longer used in unregister, so you might want to remove that too.
        Hide
        Shai Erera added a comment -

        Thanks Anshum. Good catch on OnReconnect()! I applied all your comments.

        Show
        Shai Erera added a comment - Thanks Anshum. Good catch on OnReconnect()! I applied all your comments.
        Hide
        Anshum Gupta added a comment -

        One last thing and it's a +1 from my end.

        Can you also remove the check for collection == null in the line below?That check would never be false due to the assert right above.
        It's not something that you changed but I think it'd be good to just clean that up.

        public void unregister(String coreName, CoreDescriptor cd) throws InterruptedException, KeeperException {
            final String coreNodeName = cd.getCloudDescriptor().getCoreNodeName();
            final String collection = cd.getCloudDescriptor().getCollectionName();
            assert collection != null;
        
            if (collection == null || collection.trim().length() == 0) {
              log.error("No collection was specified.");
        
        Show
        Anshum Gupta added a comment - One last thing and it's a +1 from my end. Can you also remove the check for collection == null in the line below?That check would never be false due to the assert right above. It's not something that you changed but I think it'd be good to just clean that up. public void unregister( String coreName, CoreDescriptor cd) throws InterruptedException, KeeperException { final String coreNodeName = cd.getCloudDescriptor().getCoreNodeName(); final String collection = cd.getCloudDescriptor().getCollectionName(); assert collection != null ; if (collection == null || collection.trim().length() == 0) { log.error( "No collection was specified." );
        Hide
        Anshum Gupta added a comment -

        Also, as the test that you've added is only useful when we manually add breakpoints and debug, I guess it's a good idea to disable the test with that comment and not increase the time taken to really run the tests on Jenkins. What do you think?

        Show
        Anshum Gupta added a comment - Also, as the test that you've added is only useful when we manually add breakpoints and debug, I guess it's a good idea to disable the test with that comment and not increase the time taken to really run the tests on Jenkins. What do you think?
        Hide
        Shai Erera added a comment -

        I think that removing that check is dangerous since someone who doesn't run w/ assertions enabled (i.e. in production) will hit an NPE, rather than see this error message. I would rather remove the assert and change the code as follows:

            if (collection == null || collection.trim().length() == 0) {
              log.error("No collection was specified.");
              assert false : "No collection was specified: [" + collection + "]";
              return;
            }
        

        About removing the test, I understand it seems like a waste of time to run a 30 seconds test that now seems to be fixed. But if we remove it, and a bug creeps back into the code, we'd never know as we don't test this case anywhere else. I prefer that we mark it (and the other test in that class as @Nightly).

        Show
        Shai Erera added a comment - I think that removing that check is dangerous since someone who doesn't run w/ assertions enabled (i.e. in production) will hit an NPE, rather than see this error message. I would rather remove the assert and change the code as follows: if (collection == null || collection.trim().length() == 0) { log.error( "No collection was specified." ); assert false : "No collection was specified: [" + collection + "]" ; return ; } About removing the test, I understand it seems like a waste of time to run a 30 seconds test that now seems to be fixed. But if we remove it, and a bug creeps back into the code, we'd never know as we don't test this case anywhere else. I prefer that we mark it (and the other test in that class as @Nightly).
        Hide
        Shai Erera added a comment -

        Marked tests as @Nightly and changed the assertion. BTW, I also noticed there is a potential bug: we check that collection.trim().length() isn't empty, but later we use collection as is. Thinking leading/trailing whitespaces in collection name doesn't make sense, I added code to CloudDescriptor to always call trim().

        Show
        Shai Erera added a comment - Marked tests as @Nightly and changed the assertion. BTW, I also noticed there is a potential bug: we check that collection.trim().length() isn't empty, but later we use collection as is. Thinking leading/trailing whitespaces in collection name doesn't make sense, I added code to CloudDescriptor to always call trim().
        Hide
        Shai Erera added a comment -

        @Nightly annotation moved to class-level.

        Show
        Shai Erera added a comment - @Nightly annotation moved to class-level.
        Hide
        Anshum Gupta added a comment -

        That code change looks fine to me and +1 on the Nightly, I certainly don't want the test removed.
        Let's not add the trim in there as it's not done anywhere else and it's a fair assumption that we wouldn't get multiple whitespaces for a collection name.

        Show
        Anshum Gupta added a comment - That code change looks fine to me and +1 on the Nightly, I certainly don't want the test removed. Let's not add the trim in there as it's not done anywhere else and it's a fair assumption that we wouldn't get multiple whitespaces for a collection name.
        Hide
        Shai Erera added a comment -

        removed .trim() call in CloudDescriptor and also in ZkController.unregister() so it's consistent with the rest of the method.

        Show
        Shai Erera added a comment - removed .trim() call in CloudDescriptor and also in ZkController.unregister() so it's consistent with the rest of the method.
        Hide
        Mike Drob added a comment -

        Also, as the test that you've added is only useful when we manually add breakpoints and debug, I guess it's a good idea to disable the test with that comment and not increase the time taken to really run the tests on Jenkins. What do you think?

        It might be possible to reproduce the context switching/breakpoint behaviour by inserting several countdown latches into the test. This might end up being more trouble than it's worth, but attempting to reliably trigger race conditions can be nearly impossible otherwise.

        Show
        Mike Drob added a comment - Also, as the test that you've added is only useful when we manually add breakpoints and debug, I guess it's a good idea to disable the test with that comment and not increase the time taken to really run the tests on Jenkins. What do you think? It might be possible to reproduce the context switching/breakpoint behaviour by inserting several countdown latches into the test. This might end up being more trouble than it's worth, but attempting to reliably trigger race conditions can be nearly impossible otherwise.
        Hide
        Anshum Gupta added a comment -

        Thanks Shai! Here's my +1 again.

        Mike Drob Thanks for pointing that out and it might be worth the effort but I'd suggest we commit this as a nightly test and then look at simulating the breakpoints by inserting the latches.

        Show
        Anshum Gupta added a comment - Thanks Shai! Here's my +1 again. Mike Drob Thanks for pointing that out and it might be worth the effort but I'd suggest we commit this as a nightly test and then look at simulating the breakpoints by inserting the latches.
        Hide
        ASF subversion and git services added a comment -

        Commit 1675274 from Shai Erera in branch 'dev/trunk'
        [ https://svn.apache.org/r1675274 ]

        SOLR-7408: Let SolrCore be the only thing which registers/unregisters a config directory listener

        Show
        ASF subversion and git services added a comment - Commit 1675274 from Shai Erera in branch 'dev/trunk' [ https://svn.apache.org/r1675274 ] SOLR-7408 : Let SolrCore be the only thing which registers/unregisters a config directory listener
        Hide
        ASF subversion and git services added a comment -

        Commit 1675301 from Shai Erera in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1675301 ]

        SOLR-7408: Let SolrCore be the only thing which registers/unregisters a config directory listener

        Show
        ASF subversion and git services added a comment - Commit 1675301 from Shai Erera in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1675301 ] SOLR-7408 : Let SolrCore be the only thing which registers/unregisters a config directory listener
        Hide
        Shai Erera added a comment -

        Committed to trunk and 5x. Thanks Anshum for the feedback!

        Show
        Shai Erera added a comment - Committed to trunk and 5x. Thanks Anshum for the feedback!
        Hide
        ASF subversion and git services added a comment -

        Commit 1682570 from hossman@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1682570 ]

        SOLR-7603: more test tweaks to protect ourselves from unexpected log levels in tests like the one introduced by SOLR-7408

        Show
        ASF subversion and git services added a comment - Commit 1682570 from hossman@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1682570 ] SOLR-7603 : more test tweaks to protect ourselves from unexpected log levels in tests like the one introduced by SOLR-7408
        Hide
        ASF subversion and git services added a comment -

        Commit 1682571 from hossman@apache.org in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1682571 ]

        SOLR-7603: more test tweaks to protect ourselves from unexpected log levels in tests like the one introduced by SOLR-7408 (merge r1682570)

        Show
        ASF subversion and git services added a comment - Commit 1682571 from hossman@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1682571 ] SOLR-7603 : more test tweaks to protect ourselves from unexpected log levels in tests like the one introduced by SOLR-7408 (merge r1682570)
        Hide
        Anshum Gupta added a comment -

        Bulk close for 5.2.0.

        Show
        Anshum Gupta added a comment - Bulk close for 5.2.0.

          People

          • Assignee:
            Shai Erera
            Reporter:
            Shai Erera
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development