Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9290

TCP-connections in CLOSE_WAIT spike during heavy indexing and do not decrease

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 5.3.2, 5.4.1, 5.5.1, 5.5.2, 6.0, 6.0.1, 6.1
    • Fix Version/s: 5.5.3, 6.2
    • Component/s: None
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT state.

      At my workplace, we have seen this issue only with 5.5.1 and could not reproduce it with 5.4.1 but from my conversation with Shalin, he knows of users with 5.3.1 running into this issue too.

      Here's an excerpt from the email Shai Erera sent to the mailing list (about what we see:

      1) It consistently reproduces on 5.5.1, but does not reproduce on 5.4.1
      2) It does not reproduce when SSL is disabled
      3) Restarting the Solr process (sometimes both need to be restarted), the
      count drops to 0, but if indexing continues, they climb up again

      When it does happen, Solr seems stuck. The leader cannot talk to the
      replica, or vice versa, the replica is usually put in DOWN state and
      there's no way to fix it besides restarting the JVM.

      Here's the mail thread: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3C46cc66220a8143dc903fa34e792059c4@vp-exc01.dips.local%3E

      Creating this issue so we could track this and have more people comment on what they see.

      1. index.sh
        0.8 kB
        Hoss Man
      2. setup-solr.sh
        1 kB
        Hoss Man
      3. setup-solr.sh
        0.6 kB
        Shalin Shekhar Mangar
      4. SOLR-9290.patch
        19 kB
        Shalin Shekhar Mangar
      5. SOLR-9290.patch
        17 kB
        Shalin Shekhar Mangar
      6. SOLR-9290.patch
        17 kB
        Shalin Shekhar Mangar
      7. SOLR-9290-debug.patch
        17 kB
        Shalin Shekhar Mangar
      8. SOLR-9290-debug.patch
        15 kB
        Shalin Shekhar Mangar

        Activity

        Hide
        jsm Johannes Meyer added a comment -

        We have the same issue on Solr 6.1.0

        Show
        jsm Johannes Meyer added a comment - We have the same issue on Solr 6.1.0
        Hide
        hossman Hoss Man added a comment -

        questions specifically for Shai Erera followng up on comments made in the mailing list thread mentioned in the isue summary...

        When it does happen, the number of CLOSE_WAITS climb very high, to the order of 30K+ entries in 'netstat'.
        ...
        When I say it does not reproduce on 5.4.1 I really mean the numbers
        don't go as high as they do in 5.5.1. Meaning, when running without
        SSL, the number of CLOSE_WAITs is smallish, usually less than a 10 (I
        would separately like to understand why we have any in that state at
        all). When running with SSL and 5.4.1, they stay low at the order of
        hundreds the most.

        • Does this only reproduce in your application, with your customized configs of Solr, or can you reproduce it using something trivial like "modify bin/solr.in.sh to point at an SSL cert, then run; bin/solr -noprompt -cloud." ?
        • Does the problem only manifest solely with indexing, or with queries as well? ie...
          • assuming a pre-built collection, and then all nodes restarted, does hammering the cluster with read only queries manifest the problem?
          • assuming a virgin cluster with no docs, does hammering the cluster w/updates but never any queries, manifest the problem?
        • Assuming you start by bringing up a virgin cluster and then begin hammering it with whatever sequences of requests are needed to manifest the problem, how long do you have to wait before the number of CLOSE_WAITS spikes high enough that you are reasonably confident the problem has occured?

        The last question being a pre-req to wondering if we can just git bisect to identify where/when the problem originated.

        Even if writing a (reliable) bash automation script (to start the cluster, and triggering requests, and monitoring the CLOSE_WAITS to see if they go over a specified threshold in under a specified timelimit, and shut everything down cleanly) is too cumbersome to have faith in running an automated git bisect run test.sh, we could still consider doing some manually driven git bisection to try and track this down, as long as each iteration doesn't take very long.

        Specifically: git merge-base says ffadf9715c4a511178183fc1411b18c1701b9f1d is the common ancestor for 5.4.1 and 5.5.1, and git log says there are 487 commits between that point and the 5.5.1 tag. Statistically speaking it should only take
        ~10 iterations to do a binary search of those commits to find the first problematic one.

        Assuming there is a manual process someone can run on a clean git checkout of 5.4.1 that takes under 10 minutes to get from "ant clean server" to an obvious splke in CLOSE_WAITS, someone with some CPU cycles to spare who doesn't mind a lot of context switching while they do their day job could be...

        1. running a command to spin up the cluster & client hammering code
        2. setting a 10 minute timer
        3. when the timer goes off, check the results of another command to count the CLOSE_WAITS
        4. git bisect good/bad
        5. repeat
          ...and within ~2-3 hours should almost certainly have tracked down when/where the problem started.
        Show
        hossman Hoss Man added a comment - questions specifically for Shai Erera followng up on comments made in the mailing list thread mentioned in the isue summary... When it does happen, the number of CLOSE_WAITS climb very high, to the order of 30K+ entries in 'netstat'. ... When I say it does not reproduce on 5.4.1 I really mean the numbers don't go as high as they do in 5.5.1. Meaning, when running without SSL, the number of CLOSE_WAITs is smallish, usually less than a 10 (I would separately like to understand why we have any in that state at all). When running with SSL and 5.4.1, they stay low at the order of hundreds the most. Does this only reproduce in your application, with your customized configs of Solr, or can you reproduce it using something trivial like "modify bin/solr.in.sh to point at an SSL cert, then run; bin/solr -noprompt -cloud ." ? Does the problem only manifest solely with indexing, or with queries as well? ie... assuming a pre-built collection, and then all nodes restarted, does hammering the cluster with read only queries manifest the problem? assuming a virgin cluster with no docs, does hammering the cluster w/updates but never any queries, manifest the problem? Assuming you start by bringing up a virgin cluster and then begin hammering it with whatever sequences of requests are needed to manifest the problem, how long do you have to wait before the number of CLOSE_WAITS spikes high enough that you are reasonably confident the problem has occured? The last question being a pre-req to wondering if we can just git bisect to identify where/when the problem originated. Even if writing a (reliable) bash automation script (to start the cluster, and triggering requests, and monitoring the CLOSE_WAITS to see if they go over a specified threshold in under a specified timelimit, and shut everything down cleanly) is too cumbersome to have faith in running an automated git bisect run test.sh , we could still consider doing some manually driven git bisection to try and track this down, as long as each iteration doesn't take very long. Specifically: git merge-base says ffadf9715c4a511178183fc1411b18c1701b9f1d is the common ancestor for 5.4.1 and 5.5.1, and git log says there are 487 commits between that point and the 5.5.1 tag. Statistically speaking it should only take ~10 iterations to do a binary search of those commits to find the first problematic one. Assuming there is a manual process someone can run on a clean git checkout of 5.4.1 that takes under 10 minutes to get from "ant clean server" to an obvious splke in CLOSE_WAITS, someone with some CPU cycles to spare who doesn't mind a lot of context switching while they do their day job could be... running a command to spin up the cluster & client hammering code setting a 10 minute timer when the timer goes off, check the results of another command to count the CLOSE_WAITS git bisect good/bad repeat ...and within ~2-3 hours should almost certainly have tracked down when/where the problem started.
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        It is reproducible very easily on stock solr with SSL enabled. My test setup creates two SSL-enabled Solr instances with a 5 shard x 2 replica collection and runs a short indexing program (just 9 update requests with 1 document each and a commit at the end). Keep on running the indexing program repeatedly and the number of connections in the CLOSE_WAIT state gradually increase.

        Interestingly, the number of connections stuck in CLOSE_WAIT decrease during indexing and increase again about 10 or so seconds after the indexing is stopped.

        I can reproduce the problem on 6.1, 6.0, 5.5.1, 5.3.2. I am not able to reproduce this on master although I don't see anything relevant that has changed since 6.1 – I tried this only once so it may have just been bad timing?

        When the connections show in CLOSE_WAIT state, the recv-q buffer always has exactly 70 bytes.

        netstat -tonp | grep CLOSE_WAIT | grep java
        tcp       70      0 127.0.0.1:56538         127.0.1.1:8983          CLOSE_WAIT  21654/java       off (0.00/0/0)
        tcp       70      0 127.0.0.1:47995         127.0.1.1:8984          CLOSE_WAIT  21654/java       off (0.00/0/0)
        tcp       70      0 127.0.0.1:47477         127.0.1.1:8984          CLOSE_WAIT  21654/java       off (0.00/0/0)
        tcp       70      0 127.0.0.1:47996         127.0.1.1:8984          CLOSE_WAIT  21654/java       off (0.00/0/0)
        tcp       70      0 127.0.0.1:56644         127.0.1.1:8983          CLOSE_WAIT  21654/java       off (0.00/0/0)
        tcp       70      0 127.0.0.1:56533         127.0.1.1:8983          CLOSE_WAIT  21654/java       off (0.00/0/0)
        ...
        

        If I run the same steps with SSL disabled then the connections in CLOSE_WAIT state have just 1 byte in recv-q. I don't see the number of such connections increasing with indexing over time but I know for a fact (from a client) that eventually more and more connections pile up in this state even without SSL.

        tcp       1      0 127.0.0.1:41723         127.0.1.1:8983          CLOSE_WAIT  2522/java        off (0.00/0/0)
        tcp       1      0 127.0.0.1:41780         127.0.1.1:8983          CLOSE_WAIT  2640/java        off (0.00/0/0)
        ...
        

        I enabled debug logging for PoolingHttpClientConnectionManager (used in 6.x) and PoolingClientConnectionManager (used in 5.x.x) and after running the indexing program and verifying that some connections are in CLOSE_WAIT, I grepped the logs for connections leased vs released and I always find the number to be the same which means that the connections are always given back to the pool.

        Now some connections hanging around in CLOSE_WAIT are to be expected because of the following (quoted from the httpclient documentation):

        One of the major shortcomings of the classic blocking I/O model is that the network socket can react to I/O events only when blocked in an I/O operation. When a connection is released back to the manager, it can be kept alive however it is unable to monitor the status of the socket and react to any I/O events. If the connection gets closed on the server side, the client side connection is unable to detect the change in the connection state (and react appropriately by closing the socket on its end).
        HttpClient tries to mitigate the problem by testing whether the connection is 'stale', that is no longer valid because it was closed on the server side, prior to using the connection for executing an HTTP request. The stale connection check is not 100% reliable. The only feasible solution that does not involve a one thread per socket model for idle connections is a dedicated monitor thread used to evict connections that are considered expired due to a long period of inactivity. The monitor thread can periodically call ClientConnectionManager#closeExpiredConnections() method to close all expired connections and evict closed connections from the pool. It can also optionally call ClientConnectionManager#closeIdleConnections() method to close all connections that have been idle over a given period of time.

        I'm going to try adding such a monitor thread and see if this is still a problem.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - It is reproducible very easily on stock solr with SSL enabled. My test setup creates two SSL-enabled Solr instances with a 5 shard x 2 replica collection and runs a short indexing program (just 9 update requests with 1 document each and a commit at the end). Keep on running the indexing program repeatedly and the number of connections in the CLOSE_WAIT state gradually increase. Interestingly, the number of connections stuck in CLOSE_WAIT decrease during indexing and increase again about 10 or so seconds after the indexing is stopped. I can reproduce the problem on 6.1, 6.0, 5.5.1, 5.3.2. I am not able to reproduce this on master although I don't see anything relevant that has changed since 6.1 – I tried this only once so it may have just been bad timing? When the connections show in CLOSE_WAIT state, the recv-q buffer always has exactly 70 bytes. netstat -tonp | grep CLOSE_WAIT | grep java tcp 70 0 127.0.0.1:56538 127.0.1.1:8983 CLOSE_WAIT 21654/java off (0.00/0/0) tcp 70 0 127.0.0.1:47995 127.0.1.1:8984 CLOSE_WAIT 21654/java off (0.00/0/0) tcp 70 0 127.0.0.1:47477 127.0.1.1:8984 CLOSE_WAIT 21654/java off (0.00/0/0) tcp 70 0 127.0.0.1:47996 127.0.1.1:8984 CLOSE_WAIT 21654/java off (0.00/0/0) tcp 70 0 127.0.0.1:56644 127.0.1.1:8983 CLOSE_WAIT 21654/java off (0.00/0/0) tcp 70 0 127.0.0.1:56533 127.0.1.1:8983 CLOSE_WAIT 21654/java off (0.00/0/0) ... If I run the same steps with SSL disabled then the connections in CLOSE_WAIT state have just 1 byte in recv-q. I don't see the number of such connections increasing with indexing over time but I know for a fact (from a client) that eventually more and more connections pile up in this state even without SSL. tcp 1 0 127.0.0.1:41723 127.0.1.1:8983 CLOSE_WAIT 2522/java off (0.00/0/0) tcp 1 0 127.0.0.1:41780 127.0.1.1:8983 CLOSE_WAIT 2640/java off (0.00/0/0) ... I enabled debug logging for PoolingHttpClientConnectionManager (used in 6.x) and PoolingClientConnectionManager (used in 5.x.x) and after running the indexing program and verifying that some connections are in CLOSE_WAIT, I grepped the logs for connections leased vs released and I always find the number to be the same which means that the connections are always given back to the pool. Now some connections hanging around in CLOSE_WAIT are to be expected because of the following (quoted from the httpclient documentation): One of the major shortcomings of the classic blocking I/O model is that the network socket can react to I/O events only when blocked in an I/O operation. When a connection is released back to the manager, it can be kept alive however it is unable to monitor the status of the socket and react to any I/O events. If the connection gets closed on the server side, the client side connection is unable to detect the change in the connection state (and react appropriately by closing the socket on its end). HttpClient tries to mitigate the problem by testing whether the connection is 'stale', that is no longer valid because it was closed on the server side, prior to using the connection for executing an HTTP request. The stale connection check is not 100% reliable. The only feasible solution that does not involve a one thread per socket model for idle connections is a dedicated monitor thread used to evict connections that are considered expired due to a long period of inactivity. The monitor thread can periodically call ClientConnectionManager#closeExpiredConnections() method to close all expired connections and evict closed connections from the pool. It can also optionally call ClientConnectionManager#closeIdleConnections() method to close all connections that have been idle over a given period of time. I'm going to try adding such a monitor thread and see if this is still a problem.
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        The setup-solr.sh is small script that I use to setup different versions with SSL enabled and debug logging enabled for httpclient.

        The debug logging looks like the following:

        2016-07-12 13:25:05.692 DEBUG (httpShardExecutor-4-thread-7-processing-x:xyz_shard3_replica1 r:core_node5 https:////127.0.1.1:8983//solr//xyz_shard3_replica2// n:127.0.1.1:8984_solr s:shard3 c:xyz [https:////127.0.1.1:8983//solr//xyz_shard3_replica2//]) [c:xyz s:shard3 r:core_node5 x:xyz_shard3_replica1] o.a.h.i.c.PoolingClientConnectionManager Connection leased: [id: 17][route: {s}->https://127.0.1.1:8983][total kept alive: 0; route allocated: 5 of 100000; total allocated: 5 of 100000]
        ...
        2016-07-12 13:25:05.791 DEBUG (recoveryExecutor-3-thread-4-processing-n:127.0.1.1:8984_solr x:xyz_shard1_replica1 s:shard1 c:xyz r:core_node2) [c:xyz s:shard1 r:core_node2 x:xyz_shard1_replica1] o.a.h.i.c.PoolingClientConnectionManager Connection released: [id: 17][route: {s}->https://127.0.1.1:8983][total kept alive: 8; route allocated: 8 of 100000; total allocated: 8 of 100000]
        

        The attached SOLR-9290-debug.patch applies to 5.3.x and changes HttpSolrClient to log the connection details including the client port number for each request.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - The setup-solr.sh is small script that I use to setup different versions with SSL enabled and debug logging enabled for httpclient. The debug logging looks like the following: 2016-07-12 13:25:05.692 DEBUG (httpShardExecutor-4-thread-7-processing-x:xyz_shard3_replica1 r:core_node5 https: ////127.0.1.1:8983//solr//xyz_shard3_replica2// n:127.0.1.1:8984_solr s:shard3 c:xyz [https:////127.0.1.1:8983//solr//xyz_shard3_replica2//]) [c:xyz s:shard3 r:core_node5 x:xyz_shard3_replica1] o.a.h.i.c.PoolingClientConnectionManager Connection leased: [id: 17][route: {s}->https://127.0.1.1:8983][total kept alive: 0; route allocated: 5 of 100000; total allocated: 5 of 100000] ... 2016-07-12 13:25:05.791 DEBUG (recoveryExecutor-3-thread-4-processing-n:127.0.1.1:8984_solr x:xyz_shard1_replica1 s:shard1 c:xyz r:core_node2) [c:xyz s:shard1 r:core_node2 x:xyz_shard1_replica1] o.a.h.i.c.PoolingClientConnectionManager Connection released: [id: 17][route: {s}->https: //127.0.1.1:8983][total kept alive: 8; route allocated: 8 of 100000; total allocated: 8 of 100000] The attached SOLR-9290 -debug.patch applies to 5.3.x and changes HttpSolrClient to log the connection details including the client port number for each request.
        Hide
        mbjorgan Mads Tomasgård Bjørgan added a comment - - edited

        I performed a bisect, yielding some commit of 5.4.1 as good, and a commit from 5.5.3 as bad. This gave the following commit: ad9b87a7285e444cd61fffb83c0aee06c8f7cef0, as the first bad commit. However, this commit only increases the limits on how many update connections that can be open. I built Solr on top of the last commit from branch_5_4 (7d52c2523c7a4ff70612742b76b934a12b493331), and implemented the commit that was supposed to be bad, and ended up with the same CLOSE_WAIT-leak. Thus - this problem affects version 5.4.1 aswell - but is harder to see as Solr isn't allowed to use that many connections when updating.

        Show
        mbjorgan Mads Tomasgård Bjørgan added a comment - - edited I performed a bisect, yielding some commit of 5.4.1 as good, and a commit from 5.5.3 as bad. This gave the following commit: ad9b87a7285e444cd61fffb83c0aee06c8f7cef0, as the first bad commit. However, this commit only increases the limits on how many update connections that can be open. I built Solr on top of the last commit from branch_5_4 (7d52c2523c7a4ff70612742b76b934a12b493331), and implemented the commit that was supposed to be bad, and ended up with the same CLOSE_WAIT-leak. Thus - this problem affects version 5.4.1 aswell - but is harder to see as Solr isn't allowed to use that many connections when updating.
        Hide
        shaie Shai Erera added a comment -

        Interestingly, the number of connections stuck in CLOSE_WAIT decrease during indexing and increase again about 10 or so seconds after the indexing is stopped.

        I've observed that too and it's not that they decrease, but rather that the connections change their state from CLOSE_WAIT to ESTABLISHED, then when indexing is done to TIME_WAIT and then finally to CLOSE_WAIT again. I believe this aligns with what the HC documentation says – the connections are not necessarily released, but kept in the pool. When you re-index again, they are reused and go back to the pool.

        However, this commit only increases the limits on how many update connections that can be open

        That's interesting and might be a temporary workaround for the problem, which I intend to test shortly. In 5.4.1 they were both modified to 100,000:

        -  public static final int DEFAULT_MAXUPDATECONNECTIONS = 10000;
        -  public static final int DEFAULT_MAXUPDATECONNECTIONSPERHOST = 100;
        +  public static final int DEFAULT_MAXUPDATECONNECTIONS = 100000;
        +  public static final int DEFAULT_MAXUPDATECONNECTIONSPERHOST = 100000;
        

        This can explain why we run into trouble with 5.5.1 but not with 5.4.1. Though even in 5.4.1 there are few hundreds of CLOSE_WAIT connections, with 5.5.1 they reach (in our case) the orders of 35-40K, at which point Solr became useless, not being able to talk to the replica or pretty much anything else.

        I see these can be defined in solr.xml, though it's not documented how, so I'm going to give it a try and will report back here.

        Show
        shaie Shai Erera added a comment - Interestingly, the number of connections stuck in CLOSE_WAIT decrease during indexing and increase again about 10 or so seconds after the indexing is stopped. I've observed that too and it's not that they decrease, but rather that the connections change their state from CLOSE_WAIT to ESTABLISHED, then when indexing is done to TIME_WAIT and then finally to CLOSE_WAIT again. I believe this aligns with what the HC documentation says – the connections are not necessarily released, but kept in the pool. When you re-index again, they are reused and go back to the pool. However, this commit only increases the limits on how many update connections that can be open That's interesting and might be a temporary workaround for the problem, which I intend to test shortly. In 5.4.1 they were both modified to 100,000: - public static final int DEFAULT_MAXUPDATECONNECTIONS = 10000; - public static final int DEFAULT_MAXUPDATECONNECTIONSPERHOST = 100; + public static final int DEFAULT_MAXUPDATECONNECTIONS = 100000; + public static final int DEFAULT_MAXUPDATECONNECTIONSPERHOST = 100000; This can explain why we run into trouble with 5.5.1 but not with 5.4.1. Though even in 5.4.1 there are few hundreds of CLOSE_WAIT connections, with 5.5.1 they reach (in our case) the orders of 35-40K, at which point Solr became useless, not being able to talk to the replica or pretty much anything else. I see these can be defined in solr.xml, though it's not documented how, so I'm going to give it a try and will report back here.
        Hide
        shaie Shai Erera added a comment -

        An update – I've modified our solr.xml (which is basically the vanilla solr.xml) with these added props (under the solrcloud element) and I do not see the connections spike anymore:

            <int name="maxUpdateConnections">10000</int>
            <int name="maxUpdateConnectionsPerHost">100</int>
        

        Those changes were part of SOLR-8533. Mark Miller on that issue you didn't explain why the defaults need to be set that high. Was there perhaps an email thread you can link to which includes more details? I ask because one thing I've noticed is that if I query solr/admin/info/system, the system.openFileDescriptorCount is very high when there are many CLOSE_WAITs. Such a change in Solr default probably need to be accompanied by an OS-level setting too, no?

        I am still running tests with those props set in solr.xml, on top of 5.5.1. Mads Tomasgård Bjørgan would you mind testing in your environment too?

        Chris Hostetter (Unused), sorry I completely missed your questions. Our solr.xml is the vanilla one, we didn't modify anything in it. We did uncomment the SSL props in solr.in.sh as the ref guide says, but aside from the key name and password, we didn't change any settings.

        Show
        shaie Shai Erera added a comment - An update – I've modified our solr.xml (which is basically the vanilla solr.xml) with these added props (under the solrcloud element) and I do not see the connections spike anymore: <int name="maxUpdateConnections">10000</int> <int name="maxUpdateConnectionsPerHost">100</int> Those changes were part of SOLR-8533 . Mark Miller on that issue you didn't explain why the defaults need to be set that high. Was there perhaps an email thread you can link to which includes more details? I ask because one thing I've noticed is that if I query solr/admin/info/system , the system.openFileDescriptorCount is very high when there are many CLOSE_WAITs. Such a change in Solr default probably need to be accompanied by an OS-level setting too, no? I am still running tests with those props set in solr.xml, on top of 5.5.1. Mads Tomasgård Bjørgan would you mind testing in your environment too? Chris Hostetter (Unused) , sorry I completely missed your questions. Our solr.xml is the vanilla one, we didn't modify anything in it. We did uncomment the SSL props in solr.in.sh as the ref guide says, but aside from the key name and password, we didn't change any settings.
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        The defaults need to be very high to avoid distributed deadlock.

        Show
        markrmiller@gmail.com Mark Miller added a comment - The defaults need to be very high to avoid distributed deadlock.
        Hide
        shaie Shai Erera added a comment -

        Thanks Mark Miller. In that case, what's your take on the issue at hand?

        Show
        shaie Shai Erera added a comment - Thanks Mark Miller . In that case, what's your take on the issue at hand?
        Hide
        shaie Shai Erera added a comment -

        Also Mark Miller, for education purposes, if you have a link to a discussion about why it may lead to a distributed deadlock, I'd be happy to read it.

        Show
        shaie Shai Erera added a comment - Also Mark Miller , for education purposes, if you have a link to a discussion about why it may lead to a distributed deadlock, I'd be happy to read it.
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        if you have a link to a discussion about why it may lead to a distributed deadlock, I'd be happy to read it.

        SOLR-683

        Same logic applies to any internal general purpose thread pools or connection pools / connection limits. Think of acquiring a thread like acquiring a lock. If there are going to be a limited number of resources, then one needs to be very careful under what circumstances those resources can be acquired.

        Show
        yseeley@gmail.com Yonik Seeley added a comment - if you have a link to a discussion about why it may lead to a distributed deadlock, I'd be happy to read it. SOLR-683 Same logic applies to any internal general purpose thread pools or connection pools / connection limits. Think of acquiring a thread like acquiring a lock. If there are going to be a limited number of resources, then one needs to be very careful under what circumstances those resources can be acquired.
        Hide
        shaie Shai Erera added a comment -

        Thanks Yonik Seeley, I'll read the issue.

        I agree with what you write in general, but we do hit an issue with these settings. That that it reproduces easily with SSL enabled suggests that the issue may not be in Solr code at all, but I wonder if we shouldn't perhaps pick smaller default values if SSL is enabled? (Our guess at the moment is that HC keeps more connections in the pool when SSL is enabled because they are more expensive to initiate, but it's just a guess).

        And maybe the proper solution would be what Shalin Shekhar Mangar wrote above – have a bg monitor which closes idle/expired connections. I actually wonder why it can't be a property of ClientConnectionManager that you can set to auto close idle/expired connections after a period of time. We can potentially have that monitor act only if SSL is enabled (or at least until non-SSL exhibits the same problems too).

        Show
        shaie Shai Erera added a comment - Thanks Yonik Seeley , I'll read the issue. I agree with what you write in general, but we do hit an issue with these settings. That that it reproduces easily with SSL enabled suggests that the issue may not be in Solr code at all, but I wonder if we shouldn't perhaps pick smaller default values if SSL is enabled? (Our guess at the moment is that HC keeps more connections in the pool when SSL is enabled because they are more expensive to initiate, but it's just a guess). And maybe the proper solution would be what Shalin Shekhar Mangar wrote above – have a bg monitor which closes idle/expired connections. I actually wonder why it can't be a property of ClientConnectionManager that you can set to auto close idle/expired connections after a period of time. We can potentially have that monitor act only if SSL is enabled (or at least until non-SSL exhibits the same problems too).
        Hide
        anshumg Anshum Gupta added a comment -

        Shalin Shekhar Mangar mentioned that he's able to reproduce this in 5.3.2 as well, which was without SOLR-8533 so we certainly need to look at this more.

        Shalin, can you confirm if you were running your tests in stock Solr ?

        Show
        anshumg Anshum Gupta added a comment - Shalin Shekhar Mangar mentioned that he's able to reproduce this in 5.3.2 as well, which was without SOLR-8533 so we certainly need to look at this more. Shalin, can you confirm if you were running your tests in stock Solr ?
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment - - edited

        This patch applies on 5.3.2. This patch adds a monitor thread for the pool created in UpdateShardHandler and with this applied, I cannot reproduce this problem anymore. There are still a few connections in CLOSE_WAIT at steady state but I verified that they belong to a different HttpClient instance in HttpShardHandlerFactory and other places.

        My hypothesis is that: We have a large limit for maxConnections and maxConnectionsPerHost. As long as the limit isn't met and the servers are decently busy, new connections will continue to be created from the pool. In 5.x and 6.x, we do not have a policy of closing idle connections so httpclient will keep these connections in CLOSE_WAIT for reuse. So we must periodically close such connections once they're idle to avoid the number of such connections increasing to absurd limits.

        Also, I think the reason this wasn't reproducible on master is because SOLR-4509 enabled eviction of idle connections by calling HttpClientBuilder#evictIdleConnections with a 50 second limit.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - - edited This patch applies on 5.3.2. This patch adds a monitor thread for the pool created in UpdateShardHandler and with this applied, I cannot reproduce this problem anymore. There are still a few connections in CLOSE_WAIT at steady state but I verified that they belong to a different HttpClient instance in HttpShardHandlerFactory and other places. My hypothesis is that: We have a large limit for maxConnections and maxConnectionsPerHost. As long as the limit isn't met and the servers are decently busy, new connections will continue to be created from the pool. In 5.x and 6.x, we do not have a policy of closing idle connections so httpclient will keep these connections in CLOSE_WAIT for reuse. So we must periodically close such connections once they're idle to avoid the number of such connections increasing to absurd limits. Also, I think the reason this wasn't reproducible on master is because SOLR-4509 enabled eviction of idle connections by calling HttpClientBuilder#evictIdleConnections with a 50 second limit.
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Shalin Shekhar Mangar mentioned that he's able to reproduce this in 5.3.2 as well, which was without SOLR-8533 so we certainly need to look at this more.

        Shalin, can you confirm if you were running your tests in stock Solr ?

        Actually it is 5.3.2 with some kerberos patches but the client which originally reported the issue was using stock 5.3.2. I don't think the changes are relevant.

        I believe this was a problem all along. It just got amplified with SOLR-8533 in 5.5.x because now the limit is higher.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Shalin Shekhar Mangar mentioned that he's able to reproduce this in 5.3.2 as well, which was without SOLR-8533 so we certainly need to look at this more. Shalin, can you confirm if you were running your tests in stock Solr ? Actually it is 5.3.2 with some kerberos patches but the client which originally reported the issue was using stock 5.3.2. I don't think the changes are relevant. I believe this was a problem all along. It just got amplified with SOLR-8533 in 5.5.x because now the limit is higher.
        Hide
        shaie Shai Erera added a comment -

        Thanks Shalin Shekhar Mangar. Few questions:

        Also, I think the reason this wasn't reproducible on master is because SOLR-4509 enabled eviction of idle threads by calling HttpClientBuilder#evictIdleConnections with a 50 second limit.

        Is this something we can apply to 5x/6x too?

        This patch adds a monitor thread for the pool created in UpdateShardHandler and with this applied

        I didn't see the monitor in the latest patch, only the log printouts. Did you forget to add it?

        There are still a few connections in CLOSE_WAIT at steady state but I verified that they belong to a different HttpClient instance in HttpShardHandlerFactory and other places.

        (1) Can/Should we have a similar monitor for HttpShardHandlerFactory?
        (2) Any reason why the two don't share the same HttpClient instance?

        This patch applies on 5.3.2

        We have a large limit for maxConnections and maxConnectionsPerHost

        I thought that hypothesis holds only after SOLR-8533. Are you saying you also saw it on 5.3.2? If so, what are the values that are set for these properties there? We definitely do not see the problem with 5.4.1, but we didn't test prior versions.

        Show
        shaie Shai Erera added a comment - Thanks Shalin Shekhar Mangar . Few questions: Also, I think the reason this wasn't reproducible on master is because SOLR-4509 enabled eviction of idle threads by calling HttpClientBuilder#evictIdleConnections with a 50 second limit. Is this something we can apply to 5x/6x too? This patch adds a monitor thread for the pool created in UpdateShardHandler and with this applied I didn't see the monitor in the latest patch, only the log printouts. Did you forget to add it? There are still a few connections in CLOSE_WAIT at steady state but I verified that they belong to a different HttpClient instance in HttpShardHandlerFactory and other places. (1) Can/Should we have a similar monitor for HttpShardHandlerFactory? (2) Any reason why the two don't share the same HttpClient instance? This patch applies on 5.3.2 We have a large limit for maxConnections and maxConnectionsPerHost I thought that hypothesis holds only after SOLR-8533 . Are you saying you also saw it on 5.3.2? If so, what are the values that are set for these properties there? We definitely do not see the problem with 5.4.1, but we didn't test prior versions.
        Hide
        shaie Shai Erera added a comment -

        I thought that hypothesis holds only after SOLR-8533. Are you saying you also saw it on 5.3.2? If so, what are the values that are set for these properties there? We definitely do not see the problem with 5.4.1, but we didn't test prior versions.

        We posted at the same time, I read your answer above. I wonder why we don't see the problem with 5.4.1. I mean, we do see CLOSE_WAITs piling, but stop at ~100 (200 for the leader).

        Show
        shaie Shai Erera added a comment - I thought that hypothesis holds only after SOLR-8533 . Are you saying you also saw it on 5.3.2? If so, what are the values that are set for these properties there? We definitely do not see the problem with 5.4.1, but we didn't test prior versions. We posted at the same time, I read your answer above. I wonder why we don't see the problem with 5.4.1. I mean, we do see CLOSE_WAITs piling, but stop at ~100 (200 for the leader).
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        I didn't see the monitor in the latest patch, only the log printouts. Did you forget to add it?

        Sorry Shai Erera, I noticed that after uploaded. I have uploaded the right patch now. Please review.

        (1) Can/Should we have a similar monitor for HttpShardHandlerFactory?

        I think so. This patch was only for my tests.

        Any reason why the two don't share the same HttpClient instance?

        Hmm. I think originally the idea was to keep the pools for indexing and querying separate but now that the limit (for updates) is so high, I wonder if it still makes sense. I mean, yes you can deadlock a distributed search because of high indexing and vice-versa if you share the pool but if you ever reach the high limit of 100,000 connections, you have more serious problems in the cluster anyway.

        I wonder why we don't see the problem with 5.4.1. I mean, we do see CLOSE_WAITs piling, but stop at ~100 (200 for the leader)

        Do you have only two replicas? Perhaps the maxConnectionsPerHost limit of 100 is kicking in?

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - I didn't see the monitor in the latest patch, only the log printouts. Did you forget to add it? Sorry Shai Erera , I noticed that after uploaded. I have uploaded the right patch now. Please review. (1) Can/Should we have a similar monitor for HttpShardHandlerFactory? I think so. This patch was only for my tests. Any reason why the two don't share the same HttpClient instance? Hmm. I think originally the idea was to keep the pools for indexing and querying separate but now that the limit (for updates) is so high, I wonder if it still makes sense. I mean, yes you can deadlock a distributed search because of high indexing and vice-versa if you share the pool but if you ever reach the high limit of 100,000 connections, you have more serious problems in the cluster anyway. I wonder why we don't see the problem with 5.4.1. I mean, we do see CLOSE_WAITs piling, but stop at ~100 (200 for the leader) Do you have only two replicas? Perhaps the maxConnectionsPerHost limit of 100 is kicking in?
        Hide
        shaie Shai Erera added a comment -

        Do you have only two replicas? Perhaps the maxConnectionsPerHost limit of 100 is kicking in?

        Yes, we do have only 2 replicas and I get why the CLOSE_WAITs stop at 100. I was asking about 5.3.2 – how could CLOSE_WAITs get high in 5.3.2 when maxConnectionsPerHost was the same as in 5.4.1?

        Show
        shaie Shai Erera added a comment - Do you have only two replicas? Perhaps the maxConnectionsPerHost limit of 100 is kicking in? Yes, we do have only 2 replicas and I get why the CLOSE_WAITs stop at 100. I was asking about 5.3.2 – how could CLOSE_WAITs get high in 5.3.2 when maxConnectionsPerHost was the same as in 5.4.1?
        Hide
        shaie Shai Erera added a comment -

        Regarding the patch, the monitor looks good. Few comments:

        • I prefer that we name it IdleConnectionsMonitor (w/ 's', plural connections). It goes for the class, field and thread name.
        • Do you intend to keep all the log statements around?
        • Do you think we should make the polling interval (10s) and idle-connections-time (50s) configurable? Perhaps through system properties?
        Show
        shaie Shai Erera added a comment - Regarding the patch, the monitor looks good. Few comments: I prefer that we name it IdleConnectionsMonitor (w/ 's', plural connections). It goes for the class, field and thread name. Do you intend to keep all the log statements around? Do you think we should make the polling interval (10s) and idle-connections-time (50s) configurable? Perhaps through system properties?
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        I was asking about 5.3.2 – how could CLOSE_WAITs get high in 5.3.2 when maxConnectionsPerHost was the same as in 5.4.1?

        5.3.2 has maxConnectionsPerHost=100 for updates and maxConnectionsPerHost=20 for queries. So on a leader you may have 100*replicationFactor+20*numShards*replicationFactor connections. For a large cluster with many shards and replicas, the overall number of such connections can be quite high.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - I was asking about 5.3.2 – how could CLOSE_WAITs get high in 5.3.2 when maxConnectionsPerHost was the same as in 5.4.1? 5.3.2 has maxConnectionsPerHost=100 for updates and maxConnectionsPerHost=20 for queries. So on a leader you may have 100*replicationFactor+20*numShards*replicationFactor connections. For a large cluster with many shards and replicas, the overall number of such connections can be quite high.
        Hide
        shaie Shai Erera added a comment -

        Oh I see. So we didn't experience the problem because we run w/ 2 replicas (and one shard currently) and with 5.4.1's settings the math for us results in a low number of connections. But someone running a larger Solr deployment could already hit that problem prior to 5.5. Thanks for the clarification!

        Show
        shaie Shai Erera added a comment - Oh I see. So we didn't experience the problem because we run w/ 2 replicas (and one shard currently) and with 5.4.1's settings the math for us results in a low number of connections. But someone running a larger Solr deployment could already hit that problem prior to 5.5. Thanks for the clarification!
        Hide
        davidabradley David A. Bradley added a comment -

        I don't understand why the preferred approach here is to just have a thread that is trying to close connections. Is the problem that these connections would never otherwise be closed? If that is the case, why can't we solve the problem of them not being closed as a part of their normal usage? It sounds like master doesn't have this problem because of different client settings? :
        "Also, I think the reason this wasn't reproducible on master is because SOLR-4509 enabled eviction of idle connections by calling HttpClientBuilder#evictIdleConnections with a 50 second limit."

        Why not backport that and avoid the problem entirely? Is it a different client version in master or something that makes it not that easy?

        "So we must periodically close such connections once they're idle to avoid the number of such connections increasing to absurd limits." It seems from the discussion here that the problem is hitting a high number of connections, which is only allowed to be so high because we asked for it. What if this thread lags behind enough that the connections get too high? It sounds like the purpose of this thread is to race to prevent Solr from doing what we asked it to do.

        The idea of having a thread to deal with any connections that end up in a bad state unexpectedly makes sense, but is the cause of all these CLOSE_WAIT connections really from unexpected behavior?

        I feel like I must be missing something.

        Show
        davidabradley David A. Bradley added a comment - I don't understand why the preferred approach here is to just have a thread that is trying to close connections. Is the problem that these connections would never otherwise be closed? If that is the case, why can't we solve the problem of them not being closed as a part of their normal usage? It sounds like master doesn't have this problem because of different client settings? : "Also, I think the reason this wasn't reproducible on master is because SOLR-4509 enabled eviction of idle connections by calling HttpClientBuilder#evictIdleConnections with a 50 second limit." Why not backport that and avoid the problem entirely? Is it a different client version in master or something that makes it not that easy? "So we must periodically close such connections once they're idle to avoid the number of such connections increasing to absurd limits." It seems from the discussion here that the problem is hitting a high number of connections, which is only allowed to be so high because we asked for it. What if this thread lags behind enough that the connections get too high? It sounds like the purpose of this thread is to race to prevent Solr from doing what we asked it to do. The idea of having a thread to deal with any connections that end up in a bad state unexpectedly makes sense, but is the cause of all these CLOSE_WAIT connections really from unexpected behavior? I feel like I must be missing something.
        Hide
        slindner Scott Lindner added a comment -

        I would like to add something, too. The problem must stem from some sort of OS-level setting. In our environment I've noticed that when a given IP+PORT combo reaches ~28k connections in a CLOSE_WAIT state that the OS, itself, cannot allow any more connections to that IP+PORT combo (i.e. even curl fails to that combo - but to other combos, including other ports on that same host - it works just fine). I mention this because the problem seems related here to whatever settings we configure solr to use and you really must change these things in combination for it to ultimately make sense or you risk hitting this problem at some point - though admittedly with the bg thread it wouldn't be permanent like it is for us today.

        Show
        slindner Scott Lindner added a comment - I would like to add something, too. The problem must stem from some sort of OS-level setting. In our environment I've noticed that when a given IP+PORT combo reaches ~28k connections in a CLOSE_WAIT state that the OS, itself, cannot allow any more connections to that IP+PORT combo (i.e. even curl fails to that combo - but to other combos, including other ports on that same host - it works just fine). I mention this because the problem seems related here to whatever settings we configure solr to use and you really must change these things in combination for it to ultimately make sense or you risk hitting this problem at some point - though admittedly with the bg thread it wouldn't be permanent like it is for us today.
        Hide
        hossman Hoss Man added a comment -

        I'm no expert but...

        I don't understand why the preferred approach here is to just have a thread that is trying to close connections. Is the problem that these connections would never otherwise be closed? ...

        ...my understanding is yes: In a situation where indexing load spikes up, you can get a lot of connections which are never completely closed. (even if they are never needed anymore)

        ...If that is the case, why can't we solve the problem of them not being closed as a part of their normal usage? ...

        again, IIUC: because they are pooled connections maintained by the HTTP layer. Per the docs shalin quoted, clients are required to call ClientConnectionManager#closeExpiredConnections() if they (ie: "we") want to ensure those connections get closed properly.

        It sounds like master doesn't have this problem because of different client settings? ... Why not backport that and avoid the problem entirely? Is it a different client version in master or something that makes it not that easy?

        master & branch_6x (and earlier) use completely diff http client APIs (see SOLR-4509) ... the HttpClientBuilder.evictIdleConnections method shalin refered to being used on master is on a class (HttpClientBuilder) that is not used at all in branch_6x.

        The docs of that method describe it doing virtually the same exact same thing on the (private connection pool for the) HttpClient as what Shalin's patch does (on the pool in the shared ClientConnectionManager) ...

        Makes this instance of HttpClient proactively evict idle connections from the connection pool using a background thread. 
        

        Which makes me wonder...

        Shalin Shekhar Mangar: why not just re-use the IdleConnectionEvictor class provided by httpcomponents (getting the exact same underlying impl as what master gets from HttpClientBuilder.evictIdleConnections) ?

        https://hc.apache.org/httpcomponents-client-4.4.x/httpclient/apidocs/org/apache/http/impl/client/IdleConnectionEvictor.html

        Show
        hossman Hoss Man added a comment - I'm no expert but... I don't understand why the preferred approach here is to just have a thread that is trying to close connections. Is the problem that these connections would never otherwise be closed? ... ...my understanding is yes: In a situation where indexing load spikes up, you can get a lot of connections which are never completely closed. (even if they are never needed anymore) ...If that is the case, why can't we solve the problem of them not being closed as a part of their normal usage? ... again, IIUC: because they are pooled connections maintained by the HTTP layer. Per the docs shalin quoted, clients are required to call ClientConnectionManager#closeExpiredConnections() if they (ie: "we") want to ensure those connections get closed properly. It sounds like master doesn't have this problem because of different client settings? ... Why not backport that and avoid the problem entirely? Is it a different client version in master or something that makes it not that easy? master & branch_6x (and earlier) use completely diff http client APIs (see SOLR-4509 ) ... the HttpClientBuilder.evictIdleConnections method shalin refered to being used on master is on a class ( HttpClientBuilder ) that is not used at all in branch_6x. The docs of that method describe it doing virtually the same exact same thing on the (private connection pool for the) HttpClient as what Shalin's patch does (on the pool in the shared ClientConnectionManager) ... Makes this instance of HttpClient proactively evict idle connections from the connection pool using a background thread. Which makes me wonder... Shalin Shekhar Mangar : why not just re-use the IdleConnectionEvictor class provided by httpcomponents (getting the exact same underlying impl as what master gets from HttpClientBuilder.evictIdleConnections ) ? https://hc.apache.org/httpcomponents-client-4.4.x/httpclient/apidocs/org/apache/http/impl/client/IdleConnectionEvictor.html
        Hide
        hossman Hoss Man added a comment -

        Somebody sanity check my understanding / summary description of the root issue...

        • Solr's use of HttpClient for intra-node communication has historically always had the potential to result in connections sitting "idle" (ie: in a CLOSE_WAIT state) for possible re-use later – but these connections are kept open indefinitely.
          • For reasons I don't understand, 'idle' connections are more likely to (exist? | be kept around indefinitely?) when the intra-node communication is over SSL.
        • maxUpdateConnections and maxUpdateConnectionsPerHost have always set hard upper limits on the number of connections that could ever be created – let alone in sitting idle in a CLOSE_WAIT state.
        • Prior to SOLR-8533, the default values for these limits was relatively low, making it unlikely that users could ever observe an extreme # of idle / CLOSE_WAIT threads – you were more likely to have your Solr cluster crash from deadlocks then notice any serious OS level problem with too many idle connections
        • After SOLR-8533, the increased default values of these limits made the problem much more noticeable
        • SOLR-4509's changes included use of a new option which results in a background thread checking for an existing idle connections on the master branch
        • This issue address the problem for branch_6x (and older) branches via a similar background thread
        Show
        hossman Hoss Man added a comment - Somebody sanity check my understanding / summary description of the root issue... Solr's use of HttpClient for intra-node communication has historically always had the potential to result in connections sitting "idle" (ie: in a CLOSE_WAIT state) for possible re-use later – but these connections are kept open indefinitely. For reasons I don't understand, 'idle' connections are more likely to (exist? | be kept around indefinitely?) when the intra-node communication is over SSL. maxUpdateConnections and maxUpdateConnectionsPerHost have always set hard upper limits on the number of connections that could ever be created – let alone in sitting idle in a CLOSE_WAIT state. Prior to SOLR-8533 , the default values for these limits was relatively low, making it unlikely that users could ever observe an extreme # of idle / CLOSE_WAIT threads – you were more likely to have your Solr cluster crash from deadlocks then notice any serious OS level problem with too many idle connections After SOLR-8533 , the increased default values of these limits made the problem much more noticeable SOLR-4509 's changes included use of a new option which results in a background thread checking for an existing idle connections on the master branch This issue address the problem for branch_6x (and older) branches via a similar background thread
        Hide
        hossman Hoss Man added a comment -

        FWIW, I'm attaching a beefed up setup-solr.sh and an index.sh i've been testing with...

        • setup-solr.sh
          • you must edit 2 variables: the path to your lucene checkout & an absolute path to the SSL keystore for jetty to use (with a password "secret)
          • spins up a 3 node cluster, then creates a collection with has 5 shards an rep factor of 3
        • index.sh
          • you must edit one variable to point at the SSL pem file for curl to use
          • loops forever doing a bunch of curl connections indexing the same 9 docs over and over, periodically commiting & sleeping, reporting the # of CLOSE_WAIT java connections at each step

        On master, index.sh never reports any CLOSE_WAIT connections for me.

        On branch_6x, I'll see the CLOSE_WAITS spike up to 40 - even with this (essentially) single threaded indexing, and stay at stead state even after killing the index.sh process

        On branch_6x, with shalin's patch, CLOSE_WAITS start at 15 (which is suspiciously 5x3) as soon as the collection is created – even w/o indexing – and stay steady state at 15 forever.

        Which begs the question: why are there 15 CLOSE_WAIT connections that last forever on branch_6x even with this patch?

        Show
        hossman Hoss Man added a comment - FWIW, I'm attaching a beefed up setup-solr.sh and an index.sh i've been testing with... setup-solr.sh you must edit 2 variables: the path to your lucene checkout & an absolute path to the SSL keystore for jetty to use (with a password "secret) spins up a 3 node cluster, then creates a collection with has 5 shards an rep factor of 3 index.sh you must edit one variable to point at the SSL pem file for curl to use loops forever doing a bunch of curl connections indexing the same 9 docs over and over, periodically commiting & sleeping, reporting the # of CLOSE_WAIT java connections at each step On master, index.sh never reports any CLOSE_WAIT connections for me. On branch_6x, I'll see the CLOSE_WAITS spike up to 40 - even with this (essentially) single threaded indexing, and stay at stead state even after killing the index.sh process On branch_6x, with shalin's patch, CLOSE_WAITS start at 15 (which is suspiciously 5x3) as soon as the collection is created – even w/o indexing – and stay steady state at 15 forever. Which begs the question: why are there 15 CLOSE_WAIT connections that last forever on branch_6x even with this patch?
        Hide
        shaie Shai Erera added a comment -

        Which begs the question: why are there 15 CLOSE_WAIT connections that last forever on branch_6x even with this patch?

        I think Shalin's patch only adds this monitor thread to UpdateShardHandler, but not to HttpShardHandlerFactory so these 15 could be from it?

        Show
        shaie Shai Erera added a comment - Which begs the question: why are there 15 CLOSE_WAIT connections that last forever on branch_6x even with this patch? I think Shalin's patch only adds this monitor thread to UpdateShardHandler , but not to HttpShardHandlerFactory so these 15 could be from it?
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Hoss has covered most of the things but just a few comments (note that I'm responding to multiple people and comments here):

        Why not backport that and avoid the problem entirely? Is it a different client version in master or something that makes it not that easy?

        We could backport SOLR-4509 to 6.x and deal with the incompatible changes but I'd certainly not backport it to 5x because it is just a huge change and I am not comfortable releasing that in a minor bug-fix release. I am sure many people running 5.x releases would also like a fix to this issue. Adding an idle eviction thread is trivial and unlikely to cause any regressions.

        Shalin Shekhar Mangar: why not just re-use the IdleConnectionEvictor class provided by httpcomponents (getting the exact same underlying impl as what master gets from HttpClientBuilder.evictIdleConnections) ?
        https://hc.apache.org/httpcomponents-client-4.4.x/httpclient/apidocs/org/apache/http/impl/client/IdleConnectionEvictor.html

        I wasn't aware of this class. But looking deeper, I see that it requires a HttpClientConnectionManager instance but the 6.x and 5.x code uses the deprecated PoolingClientConnectionManager which extends ClientConnectionManager. But now that we know it exists, I can just borrow it from the httpclient project instead of writing my own evictor. It is ASLv2 anyway.

        Somebody sanity check my understanding / summary description of the root issue...

        That sounds about right to me Hoss. Thanks for the summary!

        For reasons I don't understand, 'idle' connections are more likely to (exist? | be kept around indefinitely?) when the intra-node communication is over SSL.

        Perhaps the SSL setup/teardown overhead adds some latency such that concurrent requests end up opening more connections overall? I am just guessing here.

        Which begs the question: why are there 15 CLOSE_WAIT connections that last forever on branch_6x even with this patch?

        As Shai said, this is likely the HttpShardHandler's pool. The overseer collection processor invokes a core admin create for each replica in parallel so you get 15 connections for 15 replicas that were created by the collection API.

        I'm working on a new patch which applies on branch_6x that incorporates Shai's comments as well. We can then backport it to 5x.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Hoss has covered most of the things but just a few comments (note that I'm responding to multiple people and comments here): Why not backport that and avoid the problem entirely? Is it a different client version in master or something that makes it not that easy? We could backport SOLR-4509 to 6.x and deal with the incompatible changes but I'd certainly not backport it to 5x because it is just a huge change and I am not comfortable releasing that in a minor bug-fix release. I am sure many people running 5.x releases would also like a fix to this issue. Adding an idle eviction thread is trivial and unlikely to cause any regressions. Shalin Shekhar Mangar: why not just re-use the IdleConnectionEvictor class provided by httpcomponents (getting the exact same underlying impl as what master gets from HttpClientBuilder.evictIdleConnections) ? https://hc.apache.org/httpcomponents-client-4.4.x/httpclient/apidocs/org/apache/http/impl/client/IdleConnectionEvictor.html I wasn't aware of this class. But looking deeper, I see that it requires a HttpClientConnectionManager instance but the 6.x and 5.x code uses the deprecated PoolingClientConnectionManager which extends ClientConnectionManager. But now that we know it exists, I can just borrow it from the httpclient project instead of writing my own evictor. It is ASLv2 anyway. Somebody sanity check my understanding / summary description of the root issue... That sounds about right to me Hoss. Thanks for the summary! For reasons I don't understand, 'idle' connections are more likely to (exist? | be kept around indefinitely?) when the intra-node communication is over SSL. Perhaps the SSL setup/teardown overhead adds some latency such that concurrent requests end up opening more connections overall? I am just guessing here. Which begs the question: why are there 15 CLOSE_WAIT connections that last forever on branch_6x even with this patch? As Shai said, this is likely the HttpShardHandler's pool. The overseer collection processor invokes a core admin create for each replica in parallel so you get 15 connections for 15 replicas that were created by the collection API. I'm working on a new patch which applies on branch_6x that incorporates Shai's comments as well. We can then backport it to 5x.
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        I haven't been following this issue, but this caught my eye:

        Solr's use of HttpClient for intra-node communication has historically always had the potential to result in connections sitting "idle" (ie: in a CLOSE_WAIT state) for possible re-use later

        It's been a long time since I messed around with making sure Solr worked with persistent connections (we're talking CNET days... 2004,2005
        But CLOSE_WAIT is when one side has closed the connection... there's no going back to ESTABLISHED from that state (i.e. no reusing that connection).

        Show
        yseeley@gmail.com Yonik Seeley added a comment - I haven't been following this issue, but this caught my eye: Solr's use of HttpClient for intra-node communication has historically always had the potential to result in connections sitting "idle" (ie: in a CLOSE_WAIT state) for possible re-use later It's been a long time since I messed around with making sure Solr worked with persistent connections (we're talking CNET days... 2004,2005 But CLOSE_WAIT is when one side has closed the connection... there's no going back to ESTABLISHED from that state (i.e. no reusing that connection).
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Hmm, you're right Yonik. But we've always had an idle timeout for the http connector in jetty set to 50 seconds (I traced this back to SOLR-128). So after 50 seconds of inactivity, Jetty closes that connection from its end and the client's socket goes to CLOSE_WAIT state. As you said, this connection cannot be re-used anymore. When httpclient tries to use the connection, it does the stale check, sees the CLOSE_WAIT state and terminates the connection and gives a new one to Solr.

        So all the connections that suddenly do not show up in CLOSE_WAIT and we assumed that they went to ESTABLISHED state were actually terminated.

        So in summary, our assumption that connections in CLOSE_WAIT are kept around because of re-use is wrong but it still doesn't change the solution that I've proposed. We could also think of increasing the value of Jetty's idle timeout as a separate change but the idle eviction thread would still be necessary.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Hmm, you're right Yonik. But we've always had an idle timeout for the http connector in jetty set to 50 seconds (I traced this back to SOLR-128 ). So after 50 seconds of inactivity, Jetty closes that connection from its end and the client's socket goes to CLOSE_WAIT state. As you said, this connection cannot be re-used anymore. When httpclient tries to use the connection, it does the stale check, sees the CLOSE_WAIT state and terminates the connection and gives a new one to Solr. So all the connections that suddenly do not show up in CLOSE_WAIT and we assumed that they went to ESTABLISHED state were actually terminated. So in summary, our assumption that connections in CLOSE_WAIT are kept around because of re-use is wrong but it still doesn't change the solution that I've proposed. We could also think of increasing the value of Jetty's idle timeout as a separate change but the idle eviction thread would still be necessary.
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        why not just re-use the IdleConnectionEvictor class provided by httpcomponents

        I've gone down this road. It's not a great solution. This is why we ended up changing to the new API's instead in SOLR-4509. Just having an evictor thread is not enough - you also then want the ability to check connections before use if they have been sitting in the pool too long and that requires HttpClient changes they made in the new API's.

        Show
        markrmiller@gmail.com Mark Miller added a comment - why not just re-use the IdleConnectionEvictor class provided by httpcomponents I've gone down this road. It's not a great solution. This is why we ended up changing to the new API's instead in SOLR-4509 . Just having an evictor thread is not enough - you also then want the ability to check connections before use if they have been sitting in the pool too long and that requires HttpClient changes they made in the new API's.
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Mark Miller – But you were trying to remove the stale check and disable Nagle's algorithm as well which exposed you to the NoHttpResponseExceptions. We aren't trying to do that here. We just want to close the idle connections so that they don't keep accumulating in the CLOSE_WAIT state.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Mark Miller – But you were trying to remove the stale check and disable Nagle's algorithm as well which exposed you to the NoHttpResponseExceptions. We aren't trying to do that here. We just want to close the idle connections so that they don't keep accumulating in the CLOSE_WAIT state.
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        Okay, I didn't catch you were not removing the stale check.

        For reasons I don't understand, 'idle' connections are more likely to (exist? | be kept around indefinitely?) when the intra-node communication is over SSL.

        I think I remember reading the SSL handles connection lifecycle differently, based on the SSL spec.

        Show
        markrmiller@gmail.com Mark Miller added a comment - Okay, I didn't catch you were not removing the stale check. For reasons I don't understand, 'idle' connections are more likely to (exist? | be kept around indefinitely?) when the intra-node communication is over SSL. I think I remember reading the SSL handles connection lifecycle differently, based on the SSL spec.
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment - - edited

        Patch which starts the idle connection evictor for both UpdateShardHandler and HttpShardHandlerFactory. I changed the HttpShardHandlerFactory to use an external connection pool ala UpdateShardHandler. The defaults are set to sleep for 5 seconds and expire connections and close idle connections older than 40 seconds. I chose 40 seconds because it is slightly lower than the jetty timeout of 50 seconds. Both of these values are configurable for both updates and queries.

        This patch applies to branch_6x.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - - edited Patch which starts the idle connection evictor for both UpdateShardHandler and HttpShardHandlerFactory. I changed the HttpShardHandlerFactory to use an external connection pool ala UpdateShardHandler. The defaults are set to sleep for 5 seconds and expire connections and close idle connections older than 40 seconds. I chose 40 seconds because it is slightly lower than the jetty timeout of 50 seconds. Both of these values are configurable for both updates and queries. This patch applies to branch_6x.
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment - - edited

        Patch updated to fix two test failures: TestCoreContainer and OverseerTest.

        • The TestCoreContainer.testCustomHandlers was calling CoreContainer.load twice leading to leaked threads.
        • The OverseerTest doesn't call init on HttpShardHandler causing a NPE on close.
        • HttpShardHandlerFactory closed the pool before closing the http client itself. It is fixed to be the other way round.

        I think this is ready.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - - edited Patch updated to fix two test failures: TestCoreContainer and OverseerTest. The TestCoreContainer.testCustomHandlers was calling CoreContainer.load twice leading to leaked threads. The OverseerTest doesn't call init on HttpShardHandler causing a NPE on close. HttpShardHandlerFactory closed the pool before closing the http client itself. It is fixed to be the other way round. I think this is ready.
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        Patch looks okay to me.

        +        clientConnectionManager.shutdown();
        +        IOUtils.closeQuietly(defaultClient);
        

        Not that it likely matters, but I'd reverse this and shut down the pool after the client.

        Show
        markrmiller@gmail.com Mark Miller added a comment - Patch looks okay to me. + clientConnectionManager.shutdown(); + IOUtils.closeQuietly(defaultClient); Not that it likely matters, but I'd reverse this and shut down the pool after the client.
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Thanks for reviewing Mark but I already fixed that in the last patch.

        I found a test failure in ZkControllerTest because of a thread leak so I may post another patch soon.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Thanks for reviewing Mark but I already fixed that in the last patch. I found a test failure in ZkControllerTest because of a thread leak so I may post another patch soon.
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Patch which fixes the ZkControllerTest failure. Thanks to Varun Thacker for spotting the fix.

        I'll run precommit + tests again and then commit this patch to 6x and backport to 5x.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Patch which fixes the ZkControllerTest failure. Thanks to Varun Thacker for spotting the fix. I'll run precommit + tests again and then commit this patch to 6x and backport to 5x.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit bb7742ebc7f33f5c9f41cc3ad28b30c20a19a380 in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bb7742e ]

        SOLR-9290: TCP-connections in CLOSE_WAIT spike during heavy indexing and do not decrease

        Show
        jira-bot ASF subversion and git services added a comment - Commit bb7742ebc7f33f5c9f41cc3ad28b30c20a19a380 in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bb7742e ] SOLR-9290 : TCP-connections in CLOSE_WAIT spike during heavy indexing and do not decrease
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit d00c44de2eab6d01fb1df39a17b17fb769a0f541 in lucene-solr's branch refs/heads/branch_5x from Shalin Shekhar Mangar
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d00c44d ]

        SOLR-9290: TCP-connections in CLOSE_WAIT spike during heavy indexing and do not decrease
        (cherry picked from commit bb7742e)

        Show
        jira-bot ASF subversion and git services added a comment - Commit d00c44de2eab6d01fb1df39a17b17fb769a0f541 in lucene-solr's branch refs/heads/branch_5x from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d00c44d ] SOLR-9290 : TCP-connections in CLOSE_WAIT spike during heavy indexing and do not decrease (cherry picked from commit bb7742e)
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 00ad5efac95f38cb1df9ef33672f17a7167a656f in lucene-solr's branch refs/heads/branch_5x from Shalin Shekhar Mangar
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=00ad5ef ]

        SOLR-9290: Adding 5.5.3 section and this issue to CHANGES.txt

        Show
        jira-bot ASF subversion and git services added a comment - Commit 00ad5efac95f38cb1df9ef33672f17a7167a656f in lucene-solr's branch refs/heads/branch_5x from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=00ad5ef ] SOLR-9290 : Adding 5.5.3 section and this issue to CHANGES.txt
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit e16fb5aa3073021993595acc061cc62bd575adc2 in lucene-solr's branch refs/heads/branch_5_5 from Shalin Shekhar Mangar
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e16fb5a ]

        SOLR-9290: TCP-connections in CLOSE_WAIT spike during heavy indexing and do not decrease
        (cherry picked from commit bb7742e)

        (cherry picked from commit d00c44de2eab6d01fb1df39a17b17fb769a0f541)

        Show
        jira-bot ASF subversion and git services added a comment - Commit e16fb5aa3073021993595acc061cc62bd575adc2 in lucene-solr's branch refs/heads/branch_5_5 from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e16fb5a ] SOLR-9290 : TCP-connections in CLOSE_WAIT spike during heavy indexing and do not decrease (cherry picked from commit bb7742e) (cherry picked from commit d00c44de2eab6d01fb1df39a17b17fb769a0f541)
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Thanks everyone for the help!

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Thanks everyone for the help!
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Hoss pointed out to me privately that the test fixes for ZkController, TestCoreContainer and OverseerTest should be applied to master as well.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Hoss pointed out to me privately that the test fixes for ZkController, TestCoreContainer and OverseerTest should be applied to master as well.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 833c8ee152fc28b7ec767d0e8f8ecd346229d443 in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=833c8ee ]

        SOLR-9290: MockCoreContainer should call super.shutdown()

        Show
        jira-bot ASF subversion and git services added a comment - Commit 833c8ee152fc28b7ec767d0e8f8ecd346229d443 in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=833c8ee ] SOLR-9290 : MockCoreContainer should call super.shutdown()
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        It looks like all the test fixes that we made here were already fixed by SOLR-4509 on master. I only had to add a super.shutdown() call in MockCoreContainer just for the sake of correctness.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - It looks like all the test fixes that we made here were already fixed by SOLR-4509 on master. I only had to add a super.shutdown() call in MockCoreContainer just for the sake of correctness.
        Hide
        elyograg Shawn Heisey added a comment -

        I recently joined the jetty-users mailing list for other Solr-related issues.

        Ten days ago (July 7th in my timezone) somebody sent a message to that list about encountering a large number of CLOSE_WAIT connections when using SSL.

        We have an ensemble of three jetty-servers running jetty 9.3.8 on CentOS. There is a fairly high rate of communication between the servers. When we run the ensemble without SSL, everything works perfectly, but once SSL is activated, exactly one of the servers start to get a massive amount of connections in CLOSE_WAIT (more than 50 000). This, again, causes the Old Gen-part of the heap memory in the JVM to fill up, and the server becomes unable to communicate with exactly one of the other two, with Couldn’t resolve address. However, the other machine can still communicate with the one that breaks down, and the 3^rd machine can communicate with both.
        It seems that other people using Jetty (not just Solr) are experiencing a buildup of CLOSE_WAIT connections when using SSL.

        I'm going to mention this issue on the mailing list thread, and try to find out whether they are having the problem with software other than Solr.

        Show
        elyograg Shawn Heisey added a comment - I recently joined the jetty-users mailing list for other Solr-related issues. Ten days ago (July 7th in my timezone) somebody sent a message to that list about encountering a large number of CLOSE_WAIT connections when using SSL. We have an ensemble of three jetty-servers running jetty 9.3.8 on CentOS. There is a fairly high rate of communication between the servers. When we run the ensemble without SSL, everything works perfectly, but once SSL is activated, exactly one of the servers start to get a massive amount of connections in CLOSE_WAIT (more than 50 000). This, again, causes the Old Gen-part of the heap memory in the JVM to fill up, and the server becomes unable to communicate with exactly one of the other two, with Couldn’t resolve address. However, the other machine can still communicate with the one that breaks down, and the 3^rd machine can communicate with both. It seems that other people using Jetty (not just Solr) are experiencing a buildup of CLOSE_WAIT connections when using SSL. I'm going to mention this issue on the mailing list thread, and try to find out whether they are having the problem with software other than Solr.
        Hide
        mikemccand Michael McCandless added a comment -

        Bulk close resolved issues after 6.2.0 release.

        Show
        mikemccand Michael McCandless added a comment - Bulk close resolved issues after 6.2.0 release.

          People

          • Assignee:
            shalinmangar Shalin Shekhar Mangar
            Reporter:
            anshumg Anshum Gupta
          • Votes:
            0 Vote for this issue
            Watchers:
            21 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development