[ROCKETMQ-384] broker flow control is abnormal when the machine's physical time drift - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 4.2.0
Fix Version/s: None
Component/s: rocketmq-broker
Labels:
None
Environment:
operating system: CentOS6.0
hardware: 8C8G VM
version: rocketmq-4.2.0
broker config: default 2m-2s-async config
producer qps: 4000
pre message byte: 10 bytes

Description

When I did a performance test for broker，I found large numbers of exception like this:
send error com.alibaba.rocketmq.client.exception.MQBrokerException: CODE: 2 DESC: [REJECTREQUEST]system busy, start flow control for a while
After I have read the related source code,I think this exception occurs when the broker think its OS is busy, it will refuse the increasing request. But I saw the server's monitor, the JVM gc is normal(normally 10ms), max cost 50ms and performed at a frequency of 10s once, and my server's CPU and disk IO also is health.But my server's NTP offset occasionally drift more than 2s.So I think the pyhsical time caused the flow control.
related code:

CommitLog.java
public PutMessageResult putMessage(final MessageExtBrokerInner msg) {
    ... ...
    long beginLockTimestamp = this.defaultMessageStore.getSystemClock().now(); //now() is System.currentTimeMillis()
    this.beginTimeInLock = beginLockTimestamp;
    ... ...
}
DefaultMessageStore.java
public boolean isOSPageCacheBusy() {
    long begin = this.getCommitLog().getBeginTimeInLock();
    long diff = this.systemClock.now() - begin;

    return diff < 10000000
      && diff > this.messageStoreConfig.getOsPageCacheBusyTimeOutMills();
}

Assume the first request run into the CommitLog's method putMessage, the physical clock is 0s, the server's physical clock drift to 2s, then the second request run into the isOSPageCacheBusy in order to check if the system is busy, so the second request is rejected by the broker for the clock drift.
So should we replace System.currentTimeMillis() with System.nanoTime() to reduce the abnormal request rejection?

Attachments

Activity

People

Assignee:: Xinyu Zhou

Reporter:: liyuzhou

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 11/Mar/18 15:20

Updated:: 12/Mar/18 06:19