Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.6.0, 3.6.3, 3.6.1, 3.6.2, 3.6.4
-
None
-
zookeeper version: 3.6.4
kernel: 3.10.0-1160.95.1.el7.x86_64
java version "1.8.0_111"
metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider
Description
In our production, we use zookeeper built-in PrometheusMetricsProvider to monitor zookeeper status, recently we observed very high latency in one of our zookeeper cluster which serve heavy load.
Measured in a heavy client side, the latency could more than 25 seconds.
[arthas@8]$ watch org.apache.zookeeper.ClientCnxn submitRequest Press Q or Ctrl+C to abort. Affect(class count: 1 , method count: 1) cost in 294 ms, listenerId: 1 method=org.apache.zookeeper.ClientCnxn.submitRequest location=AtExit ts=2023-09-06 16:01:32; [cost=28180.767833ms] result=@ArrayList[ @Object[][isEmpty=false;size=4], @ClientCnxn[sessionid:0x221f087c3d7eb6c local:/<IP>:40942 remoteserver:<IP>/<IP>:2181 lastZxid:146177807315 xid:589796 sent:589796 recv:973892 queuedpkts:0 pendingresp:4398 queuedevents:0], @ReplyHeader[585397,146177807315,0 ], ] method=org.apache.zookeeper.ClientCnxn.submitRequest location=AtExit ts=2023-09-06 16:01:34; [cost=29052.726493ms] result=@ArrayList[ @Object[][isEmpty=false;size=4], @ClientCnxn[sessionid:0x221f087c3d7eb6c local:/<IP>:40942 remoteserver:<IP>/<IP>:2181 lastZxid:146177807862 xid:589989 sent:589989 recv:974165 queuedpkts:0 pendingresp:4400 queuedevents:0], @ReplyHeader[585588,146177807862,0 ]
Observed many connections with high Recv-Q on the server side.
CommitProcWorkThread BLOCKED in org.apache.zookeeper.server.ServerStats#updateLatency:
"CommitProcWorkThread-15" #21595 daemon prio=5 os_prio=0 tid=0x00007f86d804a000 nid=0x6bca waiting for monitor entry [0x00007f86deb95000] java.lang.Thread.State: BLOCKED (on object monitor) at io.prometheus.client.CKMSQuantiles.insert(CKMSQuantiles.java:91) - waiting to lock <0x0000000784dd1a18> (a io.prometheus.client.CKMSQuantiles) at io.prometheus.client.TimeWindowQuantiles.insert(TimeWindowQuantiles.java:38) at io.prometheus.client.Summary$Child.observe(Summary.java:281) at io.prometheus.client.Summary.observe(Summary.java:307) at org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider$PrometheusSummary.add(PrometheusMetricsProvider.java:355) at org.apache.zookeeper.server.ServerStats.updateLatency(ServerStats.java:153) at org.apache.zookeeper.server.FinalRequestProcessor.updateStats(FinalRequestProcessor.java:669) at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:585) at org.apache.zookeeper.server.quorum.CommitProcessor$CommitWorkRequest.doWork(CommitProcessor.java:545) at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: - <0x0000000734923e80> (a java.util.concurrent.ThreadPoolExecutor$Worker)
The wall clock profile shows that there is lock contention within `CommitProcWorkThread` threads.
Attachments
Attachments
Issue Links
- links to