Solr
  1. Solr
  2. SOLR-3280

to many / sometimes stale CLOSE_WAIT connections from SnapPuller during / after replication

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 3.5, 3.6, 4.0-ALPHA
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      There are sometimes to many and also stale CLOSE_WAIT connections during/after replication left over on SLAVE server.
      Normally GC should clean up this but this is not always the case.
      Also if a CLOSE_WAIT is hanging then the new replication won't load.

      Dirty work around so far is to fake a TCP connection as root to that connection and close it.
      After that the new replication will load, the old index and searcher released and the system will
      return to normal operation.

      Background:
      The SnapPuller is using Apache httpclient 3.x and uses the MultiThreadedHttpConnectionManager.
      The manager holds a connection in CLOSE_WAIT after its use for further requests.
      This is done by calling releaseConnection. But if a connection is stuck it is not available any more and a new
      connection from the pool is used.

      Solution:
      After calling releaseConnection clean up with closeIdleConnections(0).

      1. SOLR-3280.patch
        0.8 kB
        Bernd Fehling

        Activity

        Hide
        Bernd Fehling added a comment -

        This patch will fix the CLOSE_WAIT issue.

        Show
        Bernd Fehling added a comment - This patch will fix the CLOSE_WAIT issue.
        Hide
        Sami Siren added a comment -

        I did some testing around replication: 2 nodes on same lan, heavy replication/heavy indexing and did not see any sockets in CLOSE_WAIT state after running it for about 1 hour.

        Perhaps you have a firewall between master and slave that drops "idle" connections somehow wrongly?

        Show
        Sami Siren added a comment - I did some testing around replication: 2 nodes on same lan, heavy replication/heavy indexing and did not see any sockets in CLOSE_WAIT state after running it for about 1 hour. Perhaps you have a firewall between master and slave that drops "idle" connections somehow wrongly?
        Hide
        Bernd Fehling added a comment -

        Nope, no firewall. I have 1 master and 2 slaves on the same lan. After replication finished the connection on master is closed, the connection on slave is in CLOSE_WAIT with a Receive-Queue 1 byte. If everything goes well the connection will be reused by MultiThreadedHttpConnectionManager, but if something goes wrong (which is very seldom on my systems) the connection will hang on CLOSE_WAIT and the new index is not swapped in.
        If you use jvisualvm on that slave and go to the MBeans tab you can see "solr/" in the tree but you can't open it because there is no sub-tree.
        The patch is just releasing the connection, if it hangs or not, and keeps everything operational. So no harm or performance impact for replication.

        Show
        Bernd Fehling added a comment - Nope, no firewall. I have 1 master and 2 slaves on the same lan. After replication finished the connection on master is closed, the connection on slave is in CLOSE_WAIT with a Receive-Queue 1 byte. If everything goes well the connection will be reused by MultiThreadedHttpConnectionManager, but if something goes wrong (which is very seldom on my systems) the connection will hang on CLOSE_WAIT and the new index is not swapped in. If you use jvisualvm on that slave and go to the MBeans tab you can see "solr/" in the tree but you can't open it because there is no sub-tree. The patch is just releasing the connection, if it hangs or not, and keeps everything operational. So no harm or performance impact for replication.
        Hide
        Sami Siren added a comment -

        but if something goes wrong (which is very seldom on my systems) the connection will hang on CLOSE_WAIT and the new index is not swapped in.

        do you have idea what this something is, anything in the logs?

        The patch is just releasing the connection, if it hangs or not, and keeps everything operational. So no harm or performance impact for replication.

        Yeah I agree. The performance impact is minimal.

        Show
        Sami Siren added a comment - but if something goes wrong (which is very seldom on my systems) the connection will hang on CLOSE_WAIT and the new index is not swapped in. do you have idea what this something is, anything in the logs? The patch is just releasing the connection, if it hangs or not, and keeps everything operational. So no harm or performance impact for replication. Yeah I agree. The performance impact is minimal.
        Hide
        Bernd Fehling added a comment -

        Sorry I can't specify it any closer, a "network hiccup" or the computing center is configuring something at the network. I don't know. There is nothing in the solr logs, just hanging. The old index is still at work and serving the requests.
        I located this with the server sys logs because the space the index located in data directory had doubled its size for longer than 1 day. One slave had this in August and October last year (solr 3.3) the other slave in October (solr 3.3) and January this year (solr 3.5). After seeing with netstat the CLOSE_WAIT and forcing it to close the system went back to normal operation, started a new searcher with new index and close the old searcher and deleted the old index.

        Show
        Bernd Fehling added a comment - Sorry I can't specify it any closer, a "network hiccup" or the computing center is configuring something at the network. I don't know. There is nothing in the solr logs, just hanging. The old index is still at work and serving the requests. I located this with the server sys logs because the space the index located in data directory had doubled its size for longer than 1 day. One slave had this in August and October last year (solr 3.3) the other slave in October (solr 3.3) and January this year (solr 3.5). After seeing with netstat the CLOSE_WAIT and forcing it to close the system went back to normal operation, started a new searcher with new index and close the old searcher and deleted the old index.
        Hide
        David Fu added a comment -

        I am facing the same problem. Any update on when this will be resolved?

        Show
        David Fu added a comment - I am facing the same problem. Any update on when this will be resolved?
        Hide
        Bernd Fehling added a comment -

        After going from solr 3.6 to 4.2.1 I haven't seen this anymore. There was pretty much rework done in SnapPuller due to multicore. Which version are you using?

        Show
        Bernd Fehling added a comment - After going from solr 3.6 to 4.2.1 I haven't seen this anymore. There was pretty much rework done in SnapPuller due to multicore. Which version are you using?
        Hide
        David Fu added a comment -

        I am still on 3.4 now. I noticed the solr4 pretty much reimplemented the snappuller and am thinking about upgrading to v4. Just out of the curiosity, what are some issues you faced in the process of upgrading from 3.6 to 4.2.1?

        Show
        David Fu added a comment - I am still on 3.4 now. I noticed the solr4 pretty much reimplemented the snappuller and am thinking about upgrading to v4. Just out of the curiosity, what are some issues you faced in the process of upgrading from 3.6 to 4.2.1?
        Hide
        Bernd Fehling added a comment -

        Just carefully read the CHANGES.txt. There is also a section "Upgrading from Solr 3.6".

        Show
        Bernd Fehling added a comment - Just carefully read the CHANGES.txt. There is also a section "Upgrading from Solr 3.6".

          People

          • Assignee:
            Robert Muir
            Reporter:
            Bernd Fehling
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development