Gremlin Driver replaces connections on the channel when it receives Exceptions that are instances of IOException or CodecException (including CorruptedFrameException). When CorruptedFrameException is thrown because response length is greater than the maxContentLength value (32kb by default), driver thinks the host might be unavailable and tries to replace Connection.
If Connection is shared among multiple requests (its pending queue is > 1), other WSConnection goes stale after connection replacement, while keeping the server executor threads busy.
Keeping the exec threads busy for stale connections prevents server from picking up new tasks for subsequent requests from the request queue. Additionally since there is a new connection added in Client, it can accept more requests and similar errors can lead to a build up in request queue. When many concurrent requests gets into this situation server become unresponsive to the new requests.
1. Have a gremlin server
2. Connect it using java driver with setting the maxContentLength pretty low, i.e. using the config below:
3. Issue concurrent requests using the cluster, where the response would be greater than 32 bytes.
One possible solution to this is to not consider channel as dead when request length exceeds maxContentFrame length.
Another fix could be the request can be deleted from the Connections' pending request map, and if there are other pending requests on the connection, close them before replacing the connection, or not replace the connection at all: