Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-8777

Duplicate Solr process can cripple a running process

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 5.3.1
    • Fix Version/s: 6.2, master (7.0)
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      Thanks to Jessica Cheng Mallet for catching this one.

      Accidentally executing the same instance of Solr twice causes the second start instance to die with an "Address already in use", but not before deleting the first instance's live_node entry, emitting "Found a previous node that still exists while trying to register a new live node <node> - removing existing node to create another".

      The second start instance dies and its ephemeral node is then removed, causing /live_nodes/<node> to be empty since the first start instance's live_node was deleted by the second.

      1. SOLR-8777.patch
        5 kB
        Shalin Shekhar Mangar
      2. SOLR-8777.patch
        3 kB
        Shalin Shekhar Mangar

        Activity

        Hide
        dragonsinth Scott Blum added a comment -

        I can take a crack at this. My thought would be to attempt the bind the port earlier on, and exit more quickly. BTW, looking at just the code, it looks like the live_node isn't the only possible disruption: ZkController.init() also tries to join overseer election and publish all nodes as down.

        Show
        dragonsinth Scott Blum added a comment - I can take a crack at this. My thought would be to attempt the bind the port earlier on, and exit more quickly. BTW, looking at just the code, it looks like the live_node isn't the only possible disruption: ZkController.init() also tries to join overseer election and publish all nodes as down.
        Hide
        dragonsinth Scott Blum added a comment -

        Oh yuck, looks like jetty controls the ordering, and doesn't bind the port until after all the ServletFilters are initialized, which is what ultimately starts up ZkController.

        Show
        dragonsinth Scott Blum added a comment - Oh yuck, looks like jetty controls the ordering, and doesn't bind the port until after all the ServletFilters are initialized, which is what ultimately starts up ZkController.
        Hide
        dragonsinth Scott Blum added a comment -

        Not completely related, but it seems like there's a bug in jetty's SocketConnector. It uses the ServerSocket constructor that automatically binds the port, then attempts to set setReuseAddress(), which makes no sense. It should use the other constructor, set the reuse_address option, then call bind() manually.

        In other news, I don't know that there's a way to change Jetty's startup sequence.. the best I could do is try to use reflection to pull the connectors off the Server and start them early. But that seems ungood.

        I suppose we could spin for a while waiting for the previous ephemeral node to disappear, and if it doesn't, error out and refuse to start?

        Show
        dragonsinth Scott Blum added a comment - Not completely related, but it seems like there's a bug in jetty's SocketConnector. It uses the ServerSocket constructor that automatically binds the port, then attempts to set setReuseAddress(), which makes no sense. It should use the other constructor, set the reuse_address option, then call bind() manually. In other news, I don't know that there's a way to change Jetty's startup sequence.. the best I could do is try to use reflection to pull the connectors off the Server and start them early. But that seems ungood. I suppose we could spin for a while waiting for the previous ephemeral node to disappear, and if it doesn't, error out and refuse to start?
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        I suppose we could spin for a while waiting for the previous ephemeral node to disappear, and if it doesn't, error out and refuse to start?

        Yeah, spinning for say 2 * sessionTimeout should do the trick. This effectively removes this optimisation for fast node restarts and we can look into bringing it back in some form at a later date.

        Not completely related, but it seems like there's a bug in jetty's SocketConnector. It uses the ServerSocket constructor that automatically binds the port, then attempts to set setReuseAddress(), which makes no sense. It should use the other constructor, set the reuse_address option, then call bind() manually.

        Perhaps Joakim Erdfelt or Greg Wilkins can chime in here?

        In other news, I don't know that there's a way to change Jetty's startup sequence.. the best I could do is try to use reflection to pull the connectors off the Server and start them early. But that seems ungood.

        Theoretically, now that we control the app server – we could move to using embedded Jetty (like we do for tests with JettySolrRunner) and control the lifecycle pretty much exactly but that is way overkill for this issue.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - I suppose we could spin for a while waiting for the previous ephemeral node to disappear, and if it doesn't, error out and refuse to start? Yeah, spinning for say 2 * sessionTimeout should do the trick. This effectively removes this optimisation for fast node restarts and we can look into bringing it back in some form at a later date. Not completely related, but it seems like there's a bug in jetty's SocketConnector. It uses the ServerSocket constructor that automatically binds the port, then attempts to set setReuseAddress(), which makes no sense. It should use the other constructor, set the reuse_address option, then call bind() manually. Perhaps Joakim Erdfelt or Greg Wilkins can chime in here? In other news, I don't know that there's a way to change Jetty's startup sequence.. the best I could do is try to use reflection to pull the connectors off the Server and start them early. But that seems ungood. Theoretically, now that we control the app server – we could move to using embedded Jetty (like we do for tests with JettySolrRunner) and control the lifecycle pretty much exactly but that is way overkill for this issue.
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Here's a patch which waits for upto twice the session timeout for the ephemeral node to go away before setting up overseer election and creating live node. If the node doesn't go away, we raise an exception and bail out.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Here's a patch which waits for upto twice the session timeout for the ephemeral node to go away before setting up overseer election and creating live node. If the node doesn't go away, we raise an exception and bail out.
        Hide
        dragonsinth Scott Blum added a comment -

        LGTM. One suggestion, it's almost as easy to make checkForExistingEphemeralNode() to use a watcher instead of a loop.

        private void checkForExistingEphemeralNode() throws KeeperException, InterruptedException {
          if (zkRunOnly) {
            return;
          }
          String nodeName = getNodeName();
          String nodePath = ZkStateReader.LIVE_NODES_ZKNODE + "/" + nodeName;
        
          if (!zkClient.exists(nodePath, true)) {
            return;
          }
        
          final CountDownLatch deletedLatch = new CountDownLatch(1);
          Stat stat = zkClient.exists(nodePath, new Watcher() {
            @Override
            public void process(WatchedEvent event) {
              if (Event.EventType.None.equals(event.getType())) {
                return;
              }
              if (Event.EventType.NodeDeleted.equals(event.getType())) {
                deletedLatch.countDown();
              }
            }
          }, true);
        
          if (stat == null) {
            // suddenly disappeared
            return;
          }
        
          boolean deleted = deletedLatch.await(zkClient.getSolrZooKeeper().getSessionTimeout() * 2, TimeUnit.MILLISECONDS);
          if (!deleted) {
            throw new SolrException(ErrorCode.SERVER_ERROR, "A previous ephemeral live node still exists. " +
                "Solr cannot continue. Please ensure that no other Solr process using the same port is running already.");
          }
        }
        
        Show
        dragonsinth Scott Blum added a comment - LGTM. One suggestion, it's almost as easy to make checkForExistingEphemeralNode() to use a watcher instead of a loop. private void checkForExistingEphemeralNode() throws KeeperException, InterruptedException { if (zkRunOnly) { return ; } String nodeName = getNodeName(); String nodePath = ZkStateReader.LIVE_NODES_ZKNODE + "/" + nodeName; if (!zkClient.exists(nodePath, true )) { return ; } final CountDownLatch deletedLatch = new CountDownLatch(1); Stat stat = zkClient.exists(nodePath, new Watcher() { @Override public void process(WatchedEvent event) { if (Event.EventType.None.equals(event.getType())) { return ; } if (Event.EventType.NodeDeleted.equals(event.getType())) { deletedLatch.countDown(); } } }, true ); if (stat == null ) { // suddenly disappeared return ; } boolean deleted = deletedLatch.await(zkClient.getSolrZooKeeper().getSessionTimeout() * 2, TimeUnit.MILLISECONDS); if (!deleted) { throw new SolrException(ErrorCode.SERVER_ERROR, "A previous ephemeral live node still exists. " + "Solr cannot continue . Please ensure that no other Solr process using the same port is running already." ); } }
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Thanks Scott! I like your solution better so this patch uses your code. I'll commit this shortly.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Thanks Scott! I like your solution better so this patch uses your code. I'll commit this shortly.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 4ea95bf8f11a9fb0b4226a0cd4b6840b845cf611 in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4ea95bf ]

        SOLR-8777: Duplicate Solr process can cripple a running process

        Show
        jira-bot ASF subversion and git services added a comment - Commit 4ea95bf8f11a9fb0b4226a0cd4b6840b845cf611 in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4ea95bf ] SOLR-8777 : Duplicate Solr process can cripple a running process
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 812fd346f7a136ccfe550a6ba0d7b0e634d68769 in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=812fd34 ]

        SOLR-8777: Duplicate Solr process can cripple a running process
        (cherry picked from commit 4ea95bf)

        Show
        jira-bot ASF subversion and git services added a comment - Commit 812fd346f7a136ccfe550a6ba0d7b0e634d68769 in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=812fd34 ] SOLR-8777 : Duplicate Solr process can cripple a running process (cherry picked from commit 4ea95bf)
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Thanks Jessica and Scott!

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Thanks Jessica and Scott!
        Hide
        mikemccand Michael McCandless added a comment -

        Bulk close resolved issues after 6.2.0 release.

        Show
        mikemccand Michael McCandless added a comment - Bulk close resolved issues after 6.2.0 release.

          People

          • Assignee:
            shalinmangar Shalin Shekhar Mangar
            Reporter:
            shalinmangar Shalin Shekhar Mangar
          • Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development