Uploaded image for project: 'Apache RocketMQ'
  1. Apache RocketMQ
  2. ROCKETMQ-384

broker flow control is abnormal when the machine's physical time drift

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 4.2.0
    • None
    • rocketmq-broker
    • None
    • operating system: CentOS6.0
      hardware: 8C8G VM
      version: rocketmq-4.2.0
      broker config: default 2m-2s-async config
      producer qps: 4000
      pre message byte: 10 bytes

    Description

          When I did a performance test for broker,I found large numbers of exception like this:
      send error com.alibaba.rocketmq.client.exception.MQBrokerException: CODE: 2 DESC: [REJECTREQUEST]system busy, start flow control for a while
          After I have read the related source code,I think this exception occurs when the broker think its OS is busy, it will refuse the increasing request. But I saw the server's monitor, the JVM gc is normal(normally 10ms), max cost 50ms and performed at a frequency of 10s once, and my server's CPU and disk IO also is health.But my server's NTP offset occasionally drift more than 2s.So I think the pyhsical time caused the flow control.
      related code:

      CommitLog.java
      public PutMessageResult putMessage(final MessageExtBrokerInner msg) {
          ... ...
          long beginLockTimestamp = this.defaultMessageStore.getSystemClock().now(); //now() is System.currentTimeMillis()
          this.beginTimeInLock = beginLockTimestamp;
          ... ...
      }
      DefaultMessageStore.java
      public boolean isOSPageCacheBusy() {
          long begin = this.getCommitLog().getBeginTimeInLock();
          long diff = this.systemClock.now() - begin;
      
          return diff < 10000000
            && diff > this.messageStoreConfig.getOsPageCacheBusyTimeOutMills();
      }
      

          Assume the first request run into the CommitLog's method putMessage, the physical clock is 0s, the server's physical clock drift to 2s, then the second request run into the isOSPageCacheBusy in order to check if the system is busy, so the second request is rejected by the broker for the clock drift.
          So should we replace System.currentTimeMillis() with System.nanoTime() to reduce the abnormal request rejection?

      Attachments

        Activity

          People

            yukon Xinyu Zhou
            liyuzhou liyuzhou
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: