FillPoolAsync attempts to create connections to fill the pool up to the target PoolSize, issuing CreateNewConnectionAsync to build-out or recover the delta.
If one of these tasks fault (eg. due to some socket error during websocket handshake), then all successful connections are disposed and an exception is thrown, which can result in either:
- ConnectionPool constructor throws, so the only option is to retry creating the client.
- A fire-and-forget task for ReplaceDeadConnectionAsync fails and the connection pool cannot recover the pool size at that time. The GetAvailableConnectionAsync retry policy can kick-in here and the pool may eventually recover.
Gremlin server configurations can include a load-balancer in-front of the server nodes (Azure Cosmos DB is one example). So there are cases where faults with new connections, are transient, but successful/healthy connection attempts can be maintained without destroying them.
Given this the proposal is the following:
- Modify FillPoolAsync to handle connection failures individually, doing 'best-effort' replacement of missing connections.
- If FillPoolAsync observes an exception, still throw but do not dispose of connection tasks that were successful (and check that state of those successful connections and dispose if they are in error?).
- In ConnectionPool constructor, apply reconnect retry policy to ReplaceDeadConnectionAsync() invocation
- Retry on any Exception (since there are various IO/Socket exceptions that can be thrown here). This maintains the previous behavior where the pool must be filled in-full at construction, and throws the underlying exception to the caller.
Post construction, if GetAvailableConnection triggers a connection replacement fire-and-forget task where the task encounters network IO/handshake error, then these exceptions cannot be observed directly by the caller. Only a ServerUnavailableException is thrown if the retries are exhausted.
To allow the underlying errors to be observed by callers, a callback/event handler could be added that is invoked in the ReplaceClosedConnectionsAsync throws. This would allow for the any exceptions to be observed and collected in some manner.
Further refinement of this: Have a way of populating ServerUnavailableException inner exception with the last observed replacement exception.