[SOLR-13100] harden/manage connectionpool used for intra-cluster communication when we know nodes go down - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

I'm spinng this idea off of some comments i made in ~~SOLR-13028~~...

In that issue, in discussion of some test failures that can happen after a node is shutdown/restarted (new emphasis added)...

The bit where the test fails is that it:

shuts down a jetty instance

starts the jetty instance again

does some waiting for all the collections to be "active" and all the replicas to be "live"

tries to send an auto-scalling 'set-cluster-preferences' config change to the cluster

The bit of test code where it does this creates an entirely new CLoudSolrClient, ignoring the existing one except for the ZKServer address, w/an explicit comment that the reason it's doing this is because the connection pool on the existing CloudSolrClient might have a stale connection to the old (Ie: dead) instance of the restarted jetty...
...
...doing this ensures that the cloudClient doesn't try to query the "dead" server directly (on a stale connection) but IIUC this issue of stale connections to the dead server instance is still problematic - and the root cause of this failure - because after the CloudSolrClient picks a random node to send the request to, on the remote solr side, that node then has to dispatch a request to each and every node, and at that point the node doing the distributed dispatch may also have a stale connection pool pointing at a server instance that's no longer listening.

The point of this issue is to explore, if/how we can – in general – better deal with pooled connections in situations where the cluster state knows that an existing node has gone down, or been restarted.

~~SOLR-13028~~ is a particular example of when/how stale pooled conection info can cause test problems – and the bulk of the discussion in that issue is about how that specific code path (in dealing with an intra-cl autoscaling handler command dispatch) can be improved to do a retry in the event of NoHttpResponseException – but not ever place where solr nodes need to talk to each other can blindly retry on every possible connection exception; and even when we can, it would be better if we could minimize the risk of the request failing in a way that would require a retry.

So why not improve our HTTP connection pool to be aware of our clusterstate and purge connections when we know odes have been shutdown/lost?

Attachments

Issue Links

relates to

SOLR-13028 Harden AutoAddReplicasPlanActionTest#testSimple

Resolved

SOLR-13038 Overseer actions fail with NoHttpResponseException following a node restart

Open

SOLR-6944 ReplicationFactorTest and HttpPartitionTest both fail with org.apache.http.NoHttpResponseException: The target server failed to respond

Open

Activity

People

Assignee:: Unassigned

Reporter:: Chris M. Hostetter

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 02/Jan/19 19:08

Updated:: 08/Jun/19 14:57