Solr
  1. Solr
  2. SOLR-683

Distributed Search / Shards Deadlock

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.3
    • Fix Version/s: 1.3
    • Component/s: search
    • Labels:
      None
    • Environment:

      Linux
      jre1.6.0_05
      8GB RAM
      2 x 2 core AMD 2.4 Ghz
      2 x 140GB disk

      Description

      Per this discussion:
      http://www.nabble.com/Distributed-Search-Strategy---Shards-td18882112.html

      Solr seems to lock up when running distributed search on three servers, with all three using shards of each other. Thread dump attached.

      1. locked.log
        180 kB
        Cameron
      2. SOLR-683.patch
        0.8 kB
        Lars Kotthoff

        Activity

        Hide
        Yonik Seeley added a comment -

        Here's the problem: deadlock is possible when the max number of concurrent HTTP requests is less than the number of possible HTTP requests (from both top-level clients, and by other shards).

        Consider the simplest case of two shards, each with just a single thread dedicated to handling incoming HTTP requests. A top-level request comes into each shard, and each shard queries the other. The second request to each shard blocks because the first thread has not yet completed. Deadlock.

        Show
        Yonik Seeley added a comment - Here's the problem: deadlock is possible when the max number of concurrent HTTP requests is less than the number of possible HTTP requests (from both top-level clients, and by other shards). Consider the simplest case of two shards, each with just a single thread dedicated to handling incoming HTTP requests. A top-level request comes into each shard, and each shard queries the other. The second request to each shard blocks because the first thread has not yet completed. Deadlock.
        Hide
        Cameron added a comment -

        So this seems to be a container level issue, not a Solr issue?

        Show
        Cameron added a comment - So this seems to be a container level issue, not a Solr issue?
        Hide
        Yonik Seeley added a comment -

        I duplicated a deadlock with two shards with 1000 client threads making requests.
        When I changed the maxThreads parameter from 250 to 10000 (in jetty.xml), the deadlocks went away... I was able to run through 1M requests.

        Show
        Yonik Seeley added a comment - I duplicated a deadlock with two shards with 1000 client threads making requests. When I changed the maxThreads parameter from 250 to 10000 (in jetty.xml), the deadlocks went away... I was able to run through 1M requests.
        Hide
        Yonik Seeley added a comment -

        So this seems to be a container level issue, not a Solr issue?

        Yes and no... it's not a low-level solr bug, and it can be solved by upping the number of concurrent threads or http requests in the container.

        But if we could set a read timeout on shard requests, we could also prevent a hard deadlock and return an error instead. In any case, we should increase the number of threads in the example jetty config and document this issue.

        Show
        Yonik Seeley added a comment - So this seems to be a container level issue, not a Solr issue? Yes and no... it's not a low-level solr bug, and it can be solved by upping the number of concurrent threads or http requests in the container. But if we could set a read timeout on shard requests, we could also prevent a hard deadlock and return an error instead. In any case, we should increase the number of threads in the example jetty config and document this issue.
        Hide
        Cameron added a comment -

        We could up the number of threads in our container, but this does not completely resolve the issue, as any sort of denial of service attack would potentially cause this to happen with no possible way of recovery. I would agree that some sort of timeout would be needed to actually solve the issue.

        Show
        Cameron added a comment - We could up the number of threads in our container, but this does not completely resolve the issue, as any sort of denial of service attack would potentially cause this to happen with no possible way of recovery. I would agree that some sort of timeout would be needed to actually solve the issue.
        Hide
        Lars Kotthoff added a comment -

        Another way to handle this would be to configure the servlet container to reject incoming connections when all available threads are in use [1][2]. This will cause failed requests which could have been served after a short wait, but eliminates the deadlock problem.

        [1] http://docs.codehaus.org/display/JETTY/Configuring+Connectors
        [2] http://tomcat.apache.org/tomcat-6.0-doc/config/http.html

        Show
        Lars Kotthoff added a comment - Another way to handle this would be to configure the servlet container to reject incoming connections when all available threads are in use [1] [2] . This will cause failed requests which could have been served after a short wait, but eliminates the deadlock problem. [1] http://docs.codehaus.org/display/JETTY/Configuring+Connectors [2] http://tomcat.apache.org/tomcat-6.0-doc/config/http.html
        Hide
        Yonik Seeley added a comment -

        The problem with a read timeout is it would cause otherwise perfectly acceptable requests to fail, even if the system is not under load (since we can't put an upper bound on how long a request can take).

        I'm resolving this for now since I upped the max threads in the example jetty.xml to 10K and documented the issue on the distributed search wiki page.

        If the servlet container can be configured to reject requests rather than blocking, that would probably be the ideal scenario. If anyone knows if Jetty can be configured to do that, we can add it to the solr example.

        Show
        Yonik Seeley added a comment - The problem with a read timeout is it would cause otherwise perfectly acceptable requests to fail, even if the system is not under load (since we can't put an upper bound on how long a request can take). I'm resolving this for now since I upped the max threads in the example jetty.xml to 10K and documented the issue on the distributed search wiki page. If the servlet container can be configured to reject requests rather than blocking, that would probably be the ideal scenario. If anyone knows if Jetty can be configured to do that, we can add it to the solr example.
        Hide
        Lars Kotthoff added a comment -

        Attaching patch which adds the configuration parameter to have an accept queue size of 0 to jetty.xml, along with a reference to this issue and a boilerplate warning.

        Show
        Lars Kotthoff added a comment - Attaching patch which adds the configuration parameter to have an accept queue size of 0 to jetty.xml, along with a reference to this issue and a boilerplate warning.
        Hide
        Yonik Seeley added a comment -

        Hmmm I see this was just committed, but are we sure it works?
        Isn't acceptQueueSize just the network level connection queue size for the socket (as normally set by the listen sys call)?
        When jetty runs out of handler threads, does it not accept new connections, or does it accept the connection and wait for a thread to become free to handle it?
        If the former, then this patch should work. If the latter, it won't.

        Show
        Yonik Seeley added a comment - Hmmm I see this was just committed, but are we sure it works? Isn't acceptQueueSize just the network level connection queue size for the socket (as normally set by the listen sys call)? When jetty runs out of handler threads, does it not accept new connections, or does it accept the connection and wait for a thread to become free to handle it? If the former, then this patch should work. If the latter, it won't.
        Hide
        Mark Miller added a comment -

        >> When jetty runs out of handler threads, does it not accept new connections, or does it accept the connection and wait for a thread to become free to handle it?

        Not sure if this is still the case, but I believe Jetty did just use the standard socket backlog queue and set it by default to the number of service threads - so you can have that many threadless requests queued up. Dunno if they changed that recently or not.

        • Mark
        Show
        Mark Miller added a comment - >> When jetty runs out of handler threads, does it not accept new connections, or does it accept the connection and wait for a thread to become free to handle it? Not sure if this is still the case, but I believe Jetty did just use the standard socket backlog queue and set it by default to the number of service threads - so you can have that many threadless requests queued up. Dunno if they changed that recently or not. Mark
        Hide
        Otis Gospodnetic added a comment -

        Hm, hard to tell from sparse Jetty javadocs.
        http://docs.codehaus.org/display/JETTY/Configuring+Connectors states:

        acceptQueueSize Number of connection requests that can be queued up before the operating system starts to send rejections.

        Sounds more like the latter than the former. That is, it sounds like Jetty itself might accept connections until the OS starts complaining. Hm, either way this doesn't help if one has an actual deadlock, like the one you described in the 1-thread-example, does it?

        Show
        Otis Gospodnetic added a comment - Hm, hard to tell from sparse Jetty javadocs. http://docs.codehaus.org/display/JETTY/Configuring+Connectors states: acceptQueueSize Number of connection requests that can be queued up before the operating system starts to send rejections. Sounds more like the latter than the former. That is, it sounds like Jetty itself might accept connections until the OS starts complaining. Hm, either way this doesn't help if one has an actual deadlock, like the one you described in the 1-thread-example, does it?
        Hide
        Lars Kotthoff added a comment -

        Not sure if this is still the case, but I believe Jetty did just use the standard socket backlog queue

        A quick look at the code suggests that this is still the case (at least for version 6.1.3 bundled with Solr).

        When jetty runs out of handler threads, does it not accept new connections, or does it accept the connection and wait for a thread to become free to handle it?

        When it runs out of handler threads it can't accept the connection because there's no thread to handle it. The code where this is implemented looks like this.

        AbstractConnector.java
        Thread current = Thread.currentThread();
        synchronized(AbstractConnector.this)
                    {
                        if (_acceptorThread==null)
                            return;
                        
                        _acceptorThread[_acceptor]=current;
                    }
        ...
        while (isRunning())
                        {
                            try
                            {
                                accept(_acceptor); 
                            }
                        }
        

        The connection is only accepted if there's a thread to handle it.

        It's clearer in the Tomcat documentation for equivalent parameter (acceptCount in http://tomcat.apache.org/tomcat-6.0-doc/config/http.html).

        Show
        Lars Kotthoff added a comment - Not sure if this is still the case, but I believe Jetty did just use the standard socket backlog queue A quick look at the code suggests that this is still the case (at least for version 6.1.3 bundled with Solr). When jetty runs out of handler threads, does it not accept new connections, or does it accept the connection and wait for a thread to become free to handle it? When it runs out of handler threads it can't accept the connection because there's no thread to handle it. The code where this is implemented looks like this. AbstractConnector.java Thread current = Thread .currentThread(); synchronized (AbstractConnector. this ) { if (_acceptorThread== null ) return ; _acceptorThread[_acceptor]=current; } ... while (isRunning()) { try { accept(_acceptor); } } The connection is only accepted if there's a thread to handle it. It's clearer in the Tomcat documentation for equivalent parameter (acceptCount in http://tomcat.apache.org/tomcat-6.0-doc/config/http.html ).
        Hide
        Yonik Seeley added a comment -

        The connection is only accepted if there's a thread to handle it.

        Yes, but not from the normal pool... it looks like there are acceptor threads that do nothing but accept socket connections.

        I just confirmed that setting the acceptQueueSize does not work to reject connections.
        I put in a configurable sleep in the search handler and made requests until they started blocking. Requests were still accepted and just hung... netstat showed them to be "ESTABLISHED".

        Further, setting a really low acceptQueueSize runs the risk of having connections rejected even in a low-load situation because jetty doesn't accept them fast enough.

        Show
        Yonik Seeley added a comment - The connection is only accepted if there's a thread to handle it. Yes, but not from the normal pool... it looks like there are acceptor threads that do nothing but accept socket connections. I just confirmed that setting the acceptQueueSize does not work to reject connections. I put in a configurable sleep in the search handler and made requests until they started blocking. Requests were still accepted and just hung... netstat showed them to be "ESTABLISHED". Further, setting a really low acceptQueueSize runs the risk of having connections rejected even in a low-load situation because jetty doesn't accept them fast enough.
        Hide
        Yonik Seeley added a comment -

        I just rolled back the second commit... I think just upping the thread count should be fine for now.

        Show
        Yonik Seeley added a comment - I just rolled back the second commit... I think just upping the thread count should be fine for now.
        Hide
        Lars Kotthoff added a comment -

        Yonik, you're right – there're separate acceptor threads, setting acceptQueueSize just affects how connections are handled when they come in too quickly to be accepted by the available acceptor threads. There's no option to influence handling connections when no executor threads are available. I've verified that Tomcat behaves in the same way.

        So the only thing we can do is up the thread count. Even setting timeouts won't help as this only affects the actual network transfers, not the execution time of the executor threads.

        Show
        Lars Kotthoff added a comment - Yonik, you're right – there're separate acceptor threads, setting acceptQueueSize just affects how connections are handled when they come in too quickly to be accepted by the available acceptor threads. There's no option to influence handling connections when no executor threads are available. I've verified that Tomcat behaves in the same way. So the only thing we can do is up the thread count. Even setting timeouts won't help as this only affects the actual network transfers, not the execution time of the executor threads.

          People

          • Assignee:
            Yonik Seeley
            Reporter:
            Cameron
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development