Uploaded image for project: 'TinkerPop'
  1. TinkerPop
  2. TINKERPOP-2769

gremlin-server does not reply with a timeout response to all timed out requests

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 3.6.0, 3.5.3, 3.6.1, 3.5.4
    • 3.7.0, 3.6.2, 3.5.5
    • server
    • None

    Description

      The following reproducer hangs because not all timed out requests are reported by gremlin-server.

      main.go

      package main
      
      import (
      	"log"
      	"sync"
      	"sync/atomic"
      
      	gremlingo "github.com/apache/tinkerpop/gremlin-go/v3/driver"
      )
      
      const (
      	gremlinServerEndpoint = "ws://localhost:8182/gremlin"
      	nWorkers              = 250
      )
      
      func main() {
      	conn, err := gremlingo.NewDriverRemoteConnection(gremlinServerEndpoint, func(settings *gremlingo.DriverRemoteConnectionSettings) {
      		settings.LogVerbosity = gremlingo.Off
      	})
      	if err != nil {
      		log.Fatalf("could not create remote connection: %v", err)
      	}
      	defer conn.Close()
      
      	g := gremlingo.Traversal_().WithRemote(conn)
      
      	var (
      		wg               sync.WaitGroup
      		nPendingRequests int32
      	)
      
      	for i := 0; i < nWorkers; i++ {
      		wg.Add(1)
      		go func(id int) {
      			defer wg.Done()
      			atomic.AddInt32(&nPendingRequests, 1)
      			result, err := g.
      				Inject(1).
      				SideEffect(&gremlingo.Lambda{"Thread.sleep(5_000)", "gremlin-groovy"}).
      				Next()
      			npr := atomic.AddInt32(&nPendingRequests, -1)
      			log.Printf("[%v] result=%v err=%v nPendingRequests=%v", id, result, err, npr)
      		}(i)
      	}
      
      	log.Print("waiting")
      	wg.Wait()
      	log.Print("done")
      }
      

      Steps to reproduce the issue:

      1. Run gremlin-server:

      docker run --rm -ti -p 8182:8182 tinkerpop/gremlin-server:3.5.4-SNAPSHOT 2>&1 | tee log
      

      2. Execute the reproducer:

      go run main.go
      

      3. Wait ~30s for the reproducer to hang.

      All these tests has been performed against the default gremlin-server configuration. It is important to note that in this machine the gremlin pool is set to 8. Higher values may require to adjust the number of workers in the reproducer to trigger the issue.

      After following these steps, the output of the reproducer should look like this:

      2022/06/29 16:08:47 waiting
      2022/06/29 16:08:52 [22] result=result{object=1 class=int64} err=<nil> nPendingRequests=249
      ...
      2022/06/29 16:09:17 [154] result=result{object=1 class=int64} err=<nil> nPendingRequests=208
      2022/06/29 16:09:17 [243] result=result{object=1 class=int64} err=<nil> nPendingRequests=207
      2022/06/29 16:09:17 [226] result=<nil> err=E0903: there are no results left nPendingRequests=206
      2022/06/29 16:09:17 [164] result=<nil> err=E0903: there are no results left nPendingRequests=204
      ...
      2022/06/29 16:09:17 [131] result=<nil> err=E0903: there are no results left nPendingRequests=113
      

      On the other side, the gremlin-server logs should look like this:

      [INFO] ? - Channel started at port 8182.
      [WARN] ? - A timeout occurred during traversal evaluation of [RequestMessage{, requestId=9d7ccc61-0e7e-4105-ae35-eb81a3895170, op='bytecode', processor='traversal', args={gremlin=[[], [inject(1), sideEffect(lambda[Thread.sleep(5_000)])]], aliases={g=g}}}] - consider increasing the limit given to evaluationTimeout
      [WARN] ? - A timeout occurred during traversal evaluation of [RequestMessage{, requestId=6f2d939a-31c9-48fd-bb54-b2bfa1fbd4de, op='bytecode', processor='traversal', args={gremlin=[[], [inject(1), sideEffect(lambda[Thread.sleep(5_000)])]], aliases={g=g}}}] - consider increasing the limit given to evaluationTimeout
      ...
      

      In this specific case,

      • 43 requests finished successfully (250-207 = 43)
      • 94 requests timed out (207-113 = 94)
      • 113 requests never finished (making the reproducer hang)

      If we check the gremlin-server logs, the number of reported timeouts matches the output of the reproducer, which makes me think that gremlin-server is the culprit.

      $ grep 'A timeout occurred' log | wc -l
      94
      

      Inspecting the hanging goroutines in the reproducer shows that they are IO waiting:

      (dlv) grs
      ...
        Goroutine 373 - User: /home/n/goroot/src/net/fd_posix.go:55 net.(*netFD).Read (0x684273) [IO wait]
      ...
      (dlv) gr 373
      Switched from 0 to 373 (thread 709858)
      (dlv) bt
       0  0x000000000043d212 in runtime.gopark
          at /home/n/goroot/src/runtime/proc.go:362
       1  0x0000000000435a4a in runtime.netpollblock
          at /home/n/goroot/src/runtime/netpoll.go:522
       2  0x0000000000465e05 in internal/poll.runtime_pollWait
          at /home/n/goroot/src/runtime/netpoll.go:302
       3  0x00000000004ff2c8 in internal/poll.(*pollDesc).wait
          at /home/n/goroot/src/internal/poll/fd_poll_runtime.go:83
       4  0x00000000004ff377 in internal/poll.(*pollDesc).waitRead
          at /home/n/goroot/src/internal/poll/fd_poll_runtime.go:88
       5  0x0000000000500f8f in internal/poll.(*FD).Read
          at /home/n/goroot/src/internal/poll/fd_unix.go:167
       6  0x0000000000684273 in net.(*netFD).Read
          at /home/n/goroot/src/net/fd_posix.go:55
       7  0x000000000069ce49 in net.(*conn).Read
          at /home/n/goroot/src/net/net.go:183
       8  0x00000000005d9eb4 in bufio.(*Reader).fill
          at /home/n/goroot/src/bufio/bufio.go:106
       9  0x00000000005da278 in bufio.(*Reader).Peek
          at /home/n/goroot/src/bufio/bufio.go:144
      10  0x00000000008298a9 in github.com/gorilla/websocket.(*Conn).read
          at /home/n/go/pkg/mod/github.com/gorilla/websocket@v1.5.0/conn.go:371
      11  0x000000000082d5fd in github.com/gorilla/websocket.(*Conn).advanceFrame
          at /home/n/go/pkg/mod/github.com/gorilla/websocket@v1.5.0/conn.go:809
      12  0x000000000082f1d1 in github.com/gorilla/websocket.(*Conn).NextReader
          at /home/n/go/pkg/mod/github.com/gorilla/websocket@v1.5.0/conn.go:1009
      13  0x000000000082fccb in github.com/gorilla/websocket.(*Conn).ReadMessage
          at /home/n/go/pkg/mod/github.com/gorilla/websocket@v1.5.0/conn.go:1093
      14  0x00000000008418fb in github.com/apache/tinkerpop/gremlin-go/v3/driver.(*gorillaTransporter).Read
          at /home/n/go/pkg/mod/github.com/apache/tinkerpop/gremlin-go/v3@v3.5.4-rc4.0.20220628161437-04f47d242010/driver/gorillaTransporter.go:116
      15  0x000000000085a302 in github.com/apache/tinkerpop/gremlin-go/v3/driver.(*gremlinServerWSProtocol).readLoop
          at /home/n/go/pkg/mod/github.com/apache/tinkerpop/gremlin-go/v3@v3.5.4-rc4.0.20220628161437-04f47d242010/driver/protocol.go:57
      16  0x000000000085c2e7 in github.com/apache/tinkerpop/gremlin-go/v3/driver.newGremlinServerWSProtocol.func1
          at /home/n/go/pkg/mod/github.com/apache/tinkerpop/gremlin-go/v3@v3.5.4-rc4.0.20220628161437-04f47d242010/driver/protocol.go:196
      17  0x000000000046a841 in runtime.goexit
          at /home/n/goroot/src/runtime/asm_amd64.s:1571
      

      The reason why the go program does not finish is that gremlin-server keeps responding to pings, therefore gremlin-go's pong handler keeps extending the read deadline of the connection.

      /tinkerpop/gremlin-go/driver/gorillaTransporter.go

      	transporter.connection.SetPongHandler(func(string) error {
      		err := transporter.connection.SetReadDeadline(time.Now().Add(2 * transporter.connSettings.keepAliveInterval))
      		if err != nil {
      			return err
      		}
      		return nil
      	})
      

      Attachments

        Activity

          People

            spmallette Stephen Mallette
            jroimartin Roi Martin
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: