Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-13100

harden/manage connectionpool used for intra-cluster communication when we know nodes go down

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      I'm spinng this idea off of some comments i made in SOLR-13028...

      In that issue, in discussion of some test failures that can happen after a node is shutdown/restarted (new emphasis added)...

      The bit where the test fails is that it:

      1. shuts down a jetty instance
      2. starts the jetty instance again
      3. does some waiting for all the collections to be "active" and all the replicas to be "live"
      4. tries to send an auto-scalling 'set-cluster-preferences' config change to the cluster

      The bit of test code where it does this creates an entirely new CLoudSolrClient, ignoring the existing one except for the ZKServer address, w/an explicit comment that the reason it's doing this is because the connection pool on the existing CloudSolrClient might have a stale connection to the old (Ie: dead) instance of the restarted jetty...
      ...
      ...doing this ensures that the cloudClient doesn't try to query the "dead" server directly (on a stale connection) but IIUC this issue of stale connections to the dead server instance is still problematic - and the root cause of this failure - because after the CloudSolrClient picks a random node to send the request to, on the remote solr side, that node then has to dispatch a request to each and every node, and at that point the node doing the distributed dispatch may also have a stale connection pool pointing at a server instance that's no longer listening.

      The point of this issue is to explore, if/how we can – in general – better deal with pooled connections in situations where the cluster state knows that an existing node has gone down, or been restarted.

      SOLR-13028 is a particular example of when/how stale pooled conection info can cause test problems – and the bulk of the discussion in that issue is about how that specific code path (in dealing with an intra-cl autoscaling handler command dispatch) can be improved to do a retry in the event of NoHttpResponseException – but not ever place where solr nodes need to talk to each other can blindly retry on every possible connection exception; and even when we can, it would be better if we could minimize the risk of the request failing in a way that would require a retry.

      So why not improve our HTTP connection pool to be aware of our clusterstate and purge connections when we know odes have been shutdown/lost?

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              hossman Chris M. Hostetter
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: