Uploaded image for project: 'TinkerPop'
  1. TinkerPop
  2. TINKERPOP-814

ConnectionPool can fill with dead Connections

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.0.1-incubating
    • Component/s: driver
    • Labels:
      None
    • Environment:
      "Cluster" of a single gremlin-server/Titan node. Single client. minConnectionPoolSize = 2; maxConnectionPoolSize = 10.

      Description

      Client gets a GryoMessageSerializerV1d0 decoding error on a couple of requests (different unfiled issue). ConnectionPool gets into the situation where there are 2 dead connections being maintained in the pool, but no new connections are being issued.

      A sample of the log output...

      2015-08-27 09:57:18,122 [Grizzly-worker(15)] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Borrowing connection from pool on Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin} - timeout in 3000 MILLISECONDS
      2015-08-27 09:57:18,122 [Grizzly-worker(15)] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Return least used Connection{host=Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin}, isDead=false, borrowed=1, pending=0} on Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin}
      2015-08-27 09:57:18,122 [Grizzly-worker(15)] DEBUG org.apache.tinkerpop.gremlin.driver.Client  - Submitted RequestMessage{, requestId=0349c1ce-19dc-4d54-9ca9-30965c9314ae, op='eval', processor='', args={<redacted>}} to - Connection{host=Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin}}
      2015-08-27 09:57:18,125 [gremlin-driver-worker-6] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Attempting to return Connection{host=Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin}} on Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin}
      2015-08-27 09:57:18,126 [gremlin-driver-worker-6] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool  - On Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin} pool size of 3 > minPoolSize 2 and borrowed of 0 <= minSimultaneousUsagePerConnection 8 so destroy Connection{host=Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin}, isDead=false, borrowed=0, pending=0}
      2015-08-27 09:57:18,126 [gremlin-driver-worker-6] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Connection{host=Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin}, isDead=false, borrowed=0, pending=0} destroyed
      2015-08-27 09:57:18,126 [Grizzly-worker(15)] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Borrowing connection from pool on Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin} - timeout in 3000 MILLISECONDS
      2015-08-27 09:57:18,126 [Grizzly-worker(15)] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Pool was initialized but a connection could not be selected earlier - waiting for connection on Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin}
      2015-08-27 09:57:18,126 [Grizzly-worker(15)] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Wait 3000 MILLISECONDS for an available connection on Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin} with Thread[Grizzly-worker(15),5,main]
      2015-08-27 09:57:21,127 [Grizzly-worker(15)] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Continue to wait for connection on -1 if {} > 0
      2015-08-27 09:57:21,127 [Grizzly-worker(15)] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Timed-out waiting for connection on Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin} - possibly unavailable
      2015-08-27 09:57:21,127 [gremlin-driver-worker-3] DEBUG org.apache.tinkerpop.gremlin.driver.Host  - Trying to reconnect to dead host at Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin}
      2015-08-27 09:57:21,128 [Grizzly-worker(15)] DEBUG org.apache.tinkerpop.gremlin.driver.Client  - Submitted RequestMessage{, requestId=67761973-7ae7-4a6a-8e62-443fdb55142d, op='eval', processor='', args=<redacted>} to - connection not initialized
      2015-08-27 09:57:21,128 [gremlin-driver-worker-3] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Trying to re-establish connection on Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin}
      2015-08-27 09:57:21,128 [Grizzly-worker(15)] ERROR EdnaMetaDataServiceGremlinImpl  - Failed man lookupMerchantAccountName
      java.lang.RuntimeException: java.lang.RuntimeException: java.util.concurrent.TimeoutException
      

      This behavior continues with each subsequent request. The gremlin-server is available at all times.

      As a workaround I have applied the following patch, but I am pretty sure it is not the correct solution

      diff --git a/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ConnectionPool.java b/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ConnectionPool.java
      index 30cd71f..cb8ebfa 100644
      --- a/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ConnectionPool.java
      +++ b/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ConnectionPool.java
      @@ -352,7 +352,7 @@ final class ConnectionPool {
                   }
      
                   remaining = to - TimeUtil.timeSince(start, unit);
      -            logger.debug("Continue to wait for connection on {} if {} > 0", remaining);
      +            logger.debug("Continue to wait for connection on {} if {} > 0", host, remaining);
               } while (remaining > 0);
      
               logger.debug("Timed-out waiting for connection on {} - possibly unavailable", host);
      @@ -371,13 +371,16 @@ final class ConnectionPool {
      
               // let the load-balancer know that the host is acting poorly
               this.cluster.loadBalancingStrategy().onUnavailable(host);
      -
           }
      
           private boolean tryReconnect(final Host h) {
               logger.debug("Trying to re-establish connection on {}", host);
      
               try {
      +            // remove dead connections
      +            for (Connection c : connections)
      +                if (c.isDead())
      +                    definitelyDestroyConnection(c);
                   connections.add(new Connection(host.getHostUri(), this, cluster, settings().maxInProcessPerConnection));
                   this.open.set(connections.size());
      

        Issue Links

          Activity

          Hide
          spmallette stephen mallette added a comment -

          Thanks for reporting this. Is there an "easy" way to get the driver into this state? You mentioned the "gryo" serialization failure...anything I can easily/readily reproduce?

          Show
          spmallette stephen mallette added a comment - Thanks for reporting this. Is there an "easy" way to get the driver into this state? You mentioned the "gryo" serialization failure...anything I can easily/readily reproduce?
          Hide
          kieransherlock Kieran Sherlock added a comment -

          Hmm. I was thinking that it was any error on the connection, e.g. a client gremlin script syntax error, but that is not the case.

          The client error I'm getting (when accessing a particular V in my DB) is

          Caused by: org.apache.tinkerpop.gremlin.driver.ser.SerializationException: java.lang.IndexOutOfBoundsException: Index: 82125230, Size: 0
          	at org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV1d0.deserializeResponse(GryoMessageSerializerV1d0.java:183)
          	at org.apache.tinkerpop.gremlin.driver.handler.WebSocketGremlinResponseDecoder.decode(WebSocketGremlinResponseDecoder.java:49)
          	at org.apache.tinkerpop.gremlin.driver.handler.WebSocketGremlinResponseDecoder.decode(WebSocketGremlinResponseDecoder.java:36)
          	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:89)
          

          I tried adding this to the driver....

          diff --git a/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ser/GryoMessageSerializerV1d0.java b/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ser/GryoMessageSerializerV1d0.java
          index 1c63046..3f5b812 100644
          --- a/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ser/GryoMessageSerializerV1d0.java
          +++ b/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ser/GryoMessageSerializerV1d0.java
          @@ -170,6 +170,11 @@ public final class GryoMessageSerializerV1d0 implements MessageSerializer {
                           final Object result = kryo.readClassAndObject(input);
                           final Map<String,Object> metaAttributes = (Map<String,Object>) kryo.readClassAndObject(input);
          
          +                if (true) {
          +                    logger.warn("Throwing indexOOBE for TINKERPOP3-814");
          +                    throw new IndexOutOfBoundsException("TINKERPOP3-814");
          +                }
          +
                           return ResponseMessage.build(requestId)
                                   .code(ResponseStatusCode.getFromValue(status))
                                   .statusMessage(statusMsg.intern())
          

          and then running a test like

              @Test
              public void driverbug3() throws Exception {
                  GryoMapper kryo = GryoMapper.build().addRegistry(TitanIoRegistry.INSTANCE).create();
                  MessageSerializer serializer = new GryoMessageSerializerV1d0(kryo);
                  Client client = Cluster.build("localhost")
                          .port(18182)
                          .maxConnectionPoolSize(10)
                          .reconnectIntialDelay(0)
                          .workerPoolSize(10)
                          .serializer(serializer)
                          .create().connect();
          
                  Map<String, Object> params = new HashMap<>();
                  params.put("id", 1);
                  try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {};
                  Thread.sleep(500);
                  try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {};
                  Thread.sleep(0);
                  try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {};
                  Thread.sleep(0);
                  try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {};
                  Thread.sleep(0);
                  try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {};
                  Thread.sleep(0);
                  try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {};
                  Thread.sleep(2000);
                  try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {e.printStackTrace();};
                  try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {e.printStackTrace();};
                  try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {e.printStackTrace();};
                  try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {e.printStackTrace();};
                  try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {e.printStackTrace();};
                  try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {e.printStackTrace();};
              }
          

          BUT I am NOT seeing the same Timeout due to

          2015-08-27 09:57:18,126 [Grizzly-worker(15)] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Wait 3000 MILLISECONDS for an available connection on Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin} with Thread[Grizzly-worker(15),5,main]
          

          I'll try some more and let you know if I come up with something.

          Show
          kieransherlock Kieran Sherlock added a comment - Hmm. I was thinking that it was any error on the connection, e.g. a client gremlin script syntax error, but that is not the case. The client error I'm getting (when accessing a particular V in my DB) is Caused by: org.apache.tinkerpop.gremlin.driver.ser.SerializationException: java.lang.IndexOutOfBoundsException: Index: 82125230, Size: 0 at org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV1d0.deserializeResponse(GryoMessageSerializerV1d0.java:183) at org.apache.tinkerpop.gremlin.driver.handler.WebSocketGremlinResponseDecoder.decode(WebSocketGremlinResponseDecoder.java:49) at org.apache.tinkerpop.gremlin.driver.handler.WebSocketGremlinResponseDecoder.decode(WebSocketGremlinResponseDecoder.java:36) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:89) I tried adding this to the driver.... diff --git a/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ser/GryoMessageSerializerV1d0.java b/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ser/GryoMessageSerializerV1d0.java index 1c63046..3f5b812 100644 --- a/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ser/GryoMessageSerializerV1d0.java +++ b/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ser/GryoMessageSerializerV1d0.java @@ -170,6 +170,11 @@ public final class GryoMessageSerializerV1d0 implements MessageSerializer { final Object result = kryo.readClassAndObject(input); final Map<String,Object> metaAttributes = (Map<String,Object>) kryo.readClassAndObject(input); + if (true) { + logger.warn("Throwing indexOOBE for TINKERPOP3-814"); + throw new IndexOutOfBoundsException("TINKERPOP3-814"); + } + return ResponseMessage.build(requestId) .code(ResponseStatusCode.getFromValue(status)) .statusMessage(statusMsg.intern()) and then running a test like @Test public void driverbug3() throws Exception { GryoMapper kryo = GryoMapper.build().addRegistry(TitanIoRegistry.INSTANCE).create(); MessageSerializer serializer = new GryoMessageSerializerV1d0(kryo); Client client = Cluster.build("localhost") .port(18182) .maxConnectionPoolSize(10) .reconnectIntialDelay(0) .workerPoolSize(10) .serializer(serializer) .create().connect(); Map<String, Object> params = new HashMap<>(); params.put("id", 1); try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {}; Thread.sleep(500); try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {}; Thread.sleep(0); try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {}; Thread.sleep(0); try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {}; Thread.sleep(0); try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {}; Thread.sleep(0); try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {}; Thread.sleep(2000); try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {e.printStackTrace();}; try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {e.printStackTrace();}; try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {e.printStackTrace();}; try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {e.printStackTrace();}; try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {e.printStackTrace();}; try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {e.printStackTrace();}; } BUT I am NOT seeing the same Timeout due to 2015-08-27 09:57:18,126 [Grizzly-worker(15)] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool - Wait 3000 MILLISECONDS for an available connection on Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin} with Thread[Grizzly-worker(15),5,main] I'll try some more and let you know if I come up with something.
          Hide
          gereedy Geoff Reedy added a comment -

          Here are two issues that can cause protocol synchronization issues and lead to dead/broken connections.

          Show
          gereedy Geoff Reedy added a comment - Here are two issues that can cause protocol synchronization issues and lead to dead/broken connections.
          Hide
          spmallette stephen mallette added a comment -

          The solution in the description wasn't too far off from the heart of the problem in some sense. First of all, the Host was getting marked unavailable too early for issues that might have been repairable in a sense. For example, a serialization problem might occur for one query but not for another (e.g. one query might return a Titan RelationIdentifier which requires a serializer and another might not). These cases where there is a chance that future requests can be made on the Host it's probably not a good idea to kill the channel and mark the Host dead. So that was resolved TINKERPOP3-816

          Then, when the Host was marked unavailable, nothing was being done to clear the connections in the pool and then (for some reason) the driver wasn't doing much to respect the pool size and new connections were just being created and added to the pool when a reconnect to the Host was attempted. So, the driver now destroys connections when the Host is marked as unavailable. On reconnect attempts it looks to create a new connection (through standard methods that respect pool size/timeouts/etc) and then issues a "ping" to see if the host is online. The "ping" is kinda dumpy, just a gremlin script with an empty string, which should just evaluate back as an empty string, but it does the job. Unfortunately, it's not easy rely on the websocket ping given the way the driver is setup now, plus we have to rely on the subprotcol capabilities since this code will need to work over the NIO channel or other means so doing websocket specific things won't be so good. Perhaps we could build a "ping" built into the subprotocol itself.

          It would be nice if I had a clean unit test here to prevent regressions, but I'm not completely sure how to write it - our integration testing infrastructure isn't advanced enough for the scenario involved i don't think.

          Show
          spmallette stephen mallette added a comment - The solution in the description wasn't too far off from the heart of the problem in some sense. First of all, the Host was getting marked unavailable too early for issues that might have been repairable in a sense. For example, a serialization problem might occur for one query but not for another (e.g. one query might return a Titan RelationIdentifier which requires a serializer and another might not). These cases where there is a chance that future requests can be made on the Host it's probably not a good idea to kill the channel and mark the Host dead. So that was resolved TINKERPOP3-816 Then, when the Host was marked unavailable, nothing was being done to clear the connections in the pool and then (for some reason) the driver wasn't doing much to respect the pool size and new connections were just being created and added to the pool when a reconnect to the Host was attempted. So, the driver now destroys connections when the Host is marked as unavailable. On reconnect attempts it looks to create a new connection (through standard methods that respect pool size/timeouts/etc) and then issues a "ping" to see if the host is online. The "ping" is kinda dumpy, just a gremlin script with an empty string, which should just evaluate back as an empty string, but it does the job. Unfortunately, it's not easy rely on the websocket ping given the way the driver is setup now, plus we have to rely on the subprotcol capabilities since this code will need to work over the NIO channel or other means so doing websocket specific things won't be so good. Perhaps we could build a "ping" built into the subprotocol itself. It would be nice if I had a clean unit test here to prevent regressions, but I'm not completely sure how to write it - our integration testing infrastructure isn't advanced enough for the scenario involved i don't think.

            People

            • Assignee:
              spmallette stephen mallette
              Reporter:
              kieransherlock Kieran Sherlock
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development