Details
Description
The following reproducer hangs because not all timed out requests are reported by gremlin-server.
main.go
package main import ( "log" "sync" "sync/atomic" gremlingo "github.com/apache/tinkerpop/gremlin-go/v3/driver" ) const ( gremlinServerEndpoint = "ws://localhost:8182/gremlin" nWorkers = 250 ) func main() { conn, err := gremlingo.NewDriverRemoteConnection(gremlinServerEndpoint, func(settings *gremlingo.DriverRemoteConnectionSettings) { settings.LogVerbosity = gremlingo.Off }) if err != nil { log.Fatalf("could not create remote connection: %v", err) } defer conn.Close() g := gremlingo.Traversal_().WithRemote(conn) var ( wg sync.WaitGroup nPendingRequests int32 ) for i := 0; i < nWorkers; i++ { wg.Add(1) go func(id int) { defer wg.Done() atomic.AddInt32(&nPendingRequests, 1) result, err := g. Inject(1). SideEffect(&gremlingo.Lambda{"Thread.sleep(5_000)", "gremlin-groovy"}). Next() npr := atomic.AddInt32(&nPendingRequests, -1) log.Printf("[%v] result=%v err=%v nPendingRequests=%v", id, result, err, npr) }(i) } log.Print("waiting") wg.Wait() log.Print("done") }
Steps to reproduce the issue:
1. Run gremlin-server:
docker run --rm -ti -p 8182:8182 tinkerpop/gremlin-server:3.5.4-SNAPSHOT 2>&1 | tee log
2. Execute the reproducer:
go run main.go
3. Wait ~30s for the reproducer to hang.
All these tests has been performed against the default gremlin-server configuration. It is important to note that in this machine the gremlin pool is set to 8. Higher values may require to adjust the number of workers in the reproducer to trigger the issue.
After following these steps, the output of the reproducer should look like this:
2022/06/29 16:08:47 waiting 2022/06/29 16:08:52 [22] result=result{object=1 class=int64} err=<nil> nPendingRequests=249 ... 2022/06/29 16:09:17 [154] result=result{object=1 class=int64} err=<nil> nPendingRequests=208 2022/06/29 16:09:17 [243] result=result{object=1 class=int64} err=<nil> nPendingRequests=207 2022/06/29 16:09:17 [226] result=<nil> err=E0903: there are no results left nPendingRequests=206 2022/06/29 16:09:17 [164] result=<nil> err=E0903: there are no results left nPendingRequests=204 ... 2022/06/29 16:09:17 [131] result=<nil> err=E0903: there are no results left nPendingRequests=113
On the other side, the gremlin-server logs should look like this:
[INFO] ? - Channel started at port 8182. [WARN] ? - A timeout occurred during traversal evaluation of [RequestMessage{, requestId=9d7ccc61-0e7e-4105-ae35-eb81a3895170, op='bytecode', processor='traversal', args={gremlin=[[], [inject(1), sideEffect(lambda[Thread.sleep(5_000)])]], aliases={g=g}}}] - consider increasing the limit given to evaluationTimeout [WARN] ? - A timeout occurred during traversal evaluation of [RequestMessage{, requestId=6f2d939a-31c9-48fd-bb54-b2bfa1fbd4de, op='bytecode', processor='traversal', args={gremlin=[[], [inject(1), sideEffect(lambda[Thread.sleep(5_000)])]], aliases={g=g}}}] - consider increasing the limit given to evaluationTimeout ...
In this specific case,
- 43 requests finished successfully (250-207 = 43)
- 94 requests timed out (207-113 = 94)
- 113 requests never finished (making the reproducer hang)
If we check the gremlin-server logs, the number of reported timeouts matches the output of the reproducer, which makes me think that gremlin-server is the culprit.
$ grep 'A timeout occurred' log | wc -l 94
Inspecting the hanging goroutines in the reproducer shows that they are IO waiting:
(dlv) grs ... Goroutine 373 - User: /home/n/goroot/src/net/fd_posix.go:55 net.(*netFD).Read (0x684273) [IO wait] ... (dlv) gr 373 Switched from 0 to 373 (thread 709858) (dlv) bt 0 0x000000000043d212 in runtime.gopark at /home/n/goroot/src/runtime/proc.go:362 1 0x0000000000435a4a in runtime.netpollblock at /home/n/goroot/src/runtime/netpoll.go:522 2 0x0000000000465e05 in internal/poll.runtime_pollWait at /home/n/goroot/src/runtime/netpoll.go:302 3 0x00000000004ff2c8 in internal/poll.(*pollDesc).wait at /home/n/goroot/src/internal/poll/fd_poll_runtime.go:83 4 0x00000000004ff377 in internal/poll.(*pollDesc).waitRead at /home/n/goroot/src/internal/poll/fd_poll_runtime.go:88 5 0x0000000000500f8f in internal/poll.(*FD).Read at /home/n/goroot/src/internal/poll/fd_unix.go:167 6 0x0000000000684273 in net.(*netFD).Read at /home/n/goroot/src/net/fd_posix.go:55 7 0x000000000069ce49 in net.(*conn).Read at /home/n/goroot/src/net/net.go:183 8 0x00000000005d9eb4 in bufio.(*Reader).fill at /home/n/goroot/src/bufio/bufio.go:106 9 0x00000000005da278 in bufio.(*Reader).Peek at /home/n/goroot/src/bufio/bufio.go:144 10 0x00000000008298a9 in github.com/gorilla/websocket.(*Conn).read at /home/n/go/pkg/mod/github.com/gorilla/websocket@v1.5.0/conn.go:371 11 0x000000000082d5fd in github.com/gorilla/websocket.(*Conn).advanceFrame at /home/n/go/pkg/mod/github.com/gorilla/websocket@v1.5.0/conn.go:809 12 0x000000000082f1d1 in github.com/gorilla/websocket.(*Conn).NextReader at /home/n/go/pkg/mod/github.com/gorilla/websocket@v1.5.0/conn.go:1009 13 0x000000000082fccb in github.com/gorilla/websocket.(*Conn).ReadMessage at /home/n/go/pkg/mod/github.com/gorilla/websocket@v1.5.0/conn.go:1093 14 0x00000000008418fb in github.com/apache/tinkerpop/gremlin-go/v3/driver.(*gorillaTransporter).Read at /home/n/go/pkg/mod/github.com/apache/tinkerpop/gremlin-go/v3@v3.5.4-rc4.0.20220628161437-04f47d242010/driver/gorillaTransporter.go:116 15 0x000000000085a302 in github.com/apache/tinkerpop/gremlin-go/v3/driver.(*gremlinServerWSProtocol).readLoop at /home/n/go/pkg/mod/github.com/apache/tinkerpop/gremlin-go/v3@v3.5.4-rc4.0.20220628161437-04f47d242010/driver/protocol.go:57 16 0x000000000085c2e7 in github.com/apache/tinkerpop/gremlin-go/v3/driver.newGremlinServerWSProtocol.func1 at /home/n/go/pkg/mod/github.com/apache/tinkerpop/gremlin-go/v3@v3.5.4-rc4.0.20220628161437-04f47d242010/driver/protocol.go:196 17 0x000000000046a841 in runtime.goexit at /home/n/goroot/src/runtime/asm_amd64.s:1571
The reason why the go program does not finish is that gremlin-server keeps responding to pings, therefore gremlin-go's pong handler keeps extending the read deadline of the connection.
/tinkerpop/gremlin-go/driver/gorillaTransporter.go
transporter.connection.SetPongHandler(func(string) error { err := transporter.connection.SetReadDeadline(time.Now().Add(2 * transporter.connSettings.keepAliveInterval)) if err != nil { return err } return nil })