Uploaded image for project: 'Ratis'
  1. Ratis
  2. RATIS-1411

Alleviate slow follower issue

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.2.0
    • server
    • None

    Description

      There is slow follower issue observed in our stress test. For example, when intensively writing 1TB data, the leader and one follower next_index is 100w+, the slow follower next_index is 50w+. The gap is huge. Which will cause a lot of WatchForCommit timeout exception.

      After rerun the test and do the investigation, the Ozone stateMachineDataCache is the key point. With stateMachineDataCache set to 1024 or more, as long as majority(leader and one follower) have committed the write request index, write request data is removed from stateMachineDataCache. Leader has to fetch that chunk of data from on-disk chunk file when grpcLogAppender of the second follower want to send that write request out.

      The time cost of reading from chunk file is much more expensive than reading from chunk file. Once one follower cannot get the data from stateMachineDataCache, it will never catch up with, till the write finishes.

      I tried using Guava Cache to replace the ResourceLimitCache(stateMachineDataCache). It doesn't make an obvious difference since the Cache size is limited. As long as the follower next_index request be evicted out of the cache, the follower start to become more and more slower.

      Then I tried using the PriorityBlockingList to replace the LinkedBlockingDeque in chunkExecutors, to put the readStatemachine task ahead of other block's write task, execute the task by entryIndex order. Although the readStatemachine will get the priority to execute first, but since there are so many readStatemachines tasks, the overall effect is less than expected.

      So the key point to resolve the slow follwer is to make sure that all its' data stay in the cache as long as possbile.

      My solution is set a threshold between the majority commited index and slow follwer's commited index to guarantee the data in cache. I use 0.75 as the ratio in my test. The effect is very well. I write 2TB data with a 3 DN cluster, each with 10 HDD. The task finisehd in 40mins without any watchForCommit timeout.

      Attachments

        Issue Links

          Activity

            People

              Sammi Sammi Chen
              Sammi Sammi Chen
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 4h
                  4h