Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
4.2.0
-
None
-
None
-
operating system: CentOS6.0
hardware: 8C8G VM
version: rocketmq-4.2.0
broker config: default 2m-2s-async config
producer qps: 4000
pre message byte: 10 bytes
Description
When I did a performance test for broker,I found large numbers of exception like this:
send error com.alibaba.rocketmq.client.exception.MQBrokerException: CODE: 2 DESC: [REJECTREQUEST]system busy, start flow control for a while
After I have read the related source code,I think this exception occurs when the broker think its OS is busy, it will refuse the increasing request. But I saw the server's monitor, the JVM gc is normal(normally 10ms), max cost 50ms and performed at a frequency of 10s once, and my server's CPU and disk IO also is health.But my server's NTP offset occasionally drift more than 2s.So I think the pyhsical time caused the flow control.
related code:
CommitLog.java public PutMessageResult putMessage(final MessageExtBrokerInner msg) { ... ... long beginLockTimestamp = this.defaultMessageStore.getSystemClock().now(); //now() is System.currentTimeMillis() this.beginTimeInLock = beginLockTimestamp; ... ... } DefaultMessageStore.java public boolean isOSPageCacheBusy() { long begin = this.getCommitLog().getBeginTimeInLock(); long diff = this.systemClock.now() - begin; return diff < 10000000 && diff > this.messageStoreConfig.getOsPageCacheBusyTimeOutMills(); }
Assume the first request run into the CommitLog's method putMessage, the physical clock is 0s, the server's physical clock drift to 2s, then the second request run into the isOSPageCacheBusy in order to check if the system is busy, so the second request is rejected by the broker for the clock drift.
So should we replace System.currentTimeMillis() with System.nanoTime() to reduce the abnormal request rejection?