Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
Looking at the flaky dashboard for master branch, the top several UTs are likely to fail at the same time. One of the common things for the failed flaky tests job is that, the execution time is more than one hour, and the successful executions are usually only about half an hour.
And I have compared the output for TestRestoreSnapshotFromClientWithRegionReplicas, for a successful run, the DisableTableProcedure can finish within one second, and for the failed run, it can take even more than half a minute.
Not sure what is the real problem, but it seems that for the failed runs, there are likely time holes in the output, i.e, there is no log output for several seconds. Like this:
2018-09-11 21:08:08,152 INFO [PEWorker-4] procedure2.ProcedureExecutor(1500): Finished pid=490, state=SUCCESS, hasLock=false; CreateTableProcedure table=testRestoreSnapshotAfterTruncate in 12.9380sec 2018-09-11 21:08:15,590 DEBUG [RpcServer.default.FPBQ.Fifo.handler=1,queue=0,port=33663] master.MasterRpcServices(1174): Checking to see if procedure is done pid=490
No log output for about 7 seconds.
And for a successful run, the same place
2018-09-12 07:47:32,488 INFO [PEWorker-7] procedure2.ProcedureExecutor(1500): Finished pid=490, state=SUCCESS, hasLock=false; CreateTableProcedure table=testRestoreSnapshotAfterTruncate in 1.2220sec 2018-09-12 07:47:32,881 DEBUG [RpcServer.default.FPBQ.Fifo.handler=3,queue=0,port=59079] master.MasterRpcServices(1174): Checking to see if procedure is done pid=490
There is no such hole.
Maybe there is big GC?
Attachments
1.
|
Print heap and gc informations in our junit ResourceChecker | Resolved | Duo Zhang | |
2.
|
flaky job should gather machine stats | Resolved | Sean Busbey | |
3.
|
Collect loadavg when gathering machine environment | Resolved | Unassigned | |
4.
|
Increase the waiting timeout for TestProcedurePriority | Resolved | Duo Zhang | |
5.
|
Split TestCloneSnapshotFromClient | Resolved | Duo Zhang | |
6.
|
Split TestRestoreSnapshotFromClient | Resolved | Duo Zhang |