Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
I'm spinng this idea off of some comments i made in SOLR-13028...
In that issue, in discussion of some test failures that can happen after a node is shutdown/restarted (new emphasis added)...
The bit where the test fails is that it:
- shuts down a jetty instance
- starts the jetty instance again
- does some waiting for all the collections to be "active" and all the replicas to be "live"
- tries to send an auto-scalling 'set-cluster-preferences' config change to the cluster
The bit of test code where it does this creates an entirely new CLoudSolrClient, ignoring the existing one except for the ZKServer address, w/an explicit comment that the reason it's doing this is because the connection pool on the existing CloudSolrClient might have a stale connection to the old (Ie: dead) instance of the restarted jetty...
...
...doing this ensures that the cloudClient doesn't try to query the "dead" server directly (on a stale connection) but IIUC this issue of stale connections to the dead server instance is still problematic - and the root cause of this failure - because after the CloudSolrClient picks a random node to send the request to, on the remote solr side, that node then has to dispatch a request to each and every node, and at that point the node doing the distributed dispatch may also have a stale connection pool pointing at a server instance that's no longer listening.
The point of this issue is to explore, if/how we can – in general – better deal with pooled connections in situations where the cluster state knows that an existing node has gone down, or been restarted.
SOLR-13028 is a particular example of when/how stale pooled conection info can cause test problems – and the bulk of the discussion in that issue is about how that specific code path (in dealing with an intra-cl autoscaling handler command dispatch) can be improved to do a retry in the event of NoHttpResponseException – but not ever place where solr nodes need to talk to each other can blindly retry on every possible connection exception; and even when we can, it would be better if we could minimize the risk of the request failing in a way that would require a retry.
So why not improve our HTTP connection pool to be aware of our clusterstate and purge connections when we know odes have been shutdown/lost?
Attachments
Attachments
Issue Links
- relates to
-
SOLR-13028 Harden AutoAddReplicasPlanActionTest#testSimple
- Resolved
-
SOLR-13038 Overseer actions fail with NoHttpResponseException following a node restart
- Open
-
SOLR-6944 ReplicationFactorTest and HttpPartitionTest both fail with org.apache.http.NoHttpResponseException: The target server failed to respond
- Open