[HDDS-8324] Remove chunk cache entry only for the lower index value used for apply transaction - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.4.0
Component/s: Ozone Datanode
Labels:
- pull-request-available

Description

ContainerStateMachine in DN have stateMachineDataCache, which cache data wrt to logIndex of ratis, this is used by leader to send data to other follower.

Issue: the cache gets cleared with incorrect logic, where all higher index is cleared, when apply transaction is called with lower index.

 (division.getInfo().isLeader()) {
  long minIndex = Arrays.stream(division.getInfo()
      .getFollowerNextIndices()).min().getAsLong();
  LOG.debug("Removing data corresponding to log index {} min index {} "
          + "from cache", index, minIndex);
  stateMachineDataCache.removeIf(k -> k >= (Math.min(minIndex, index)));
}

Impact:

with this clearing, when leader send data, it will cause disk read adding pressure over disk IO.

As solution, the check should be k <= (Math.min(minIndex, index)) where all previous index should be cleared as follower sync is done for that.

Impact with this change:

cache is controlled using LeaderNumPendingRequests (write.element-limit) default 1024 and pendingRequestsBytesLimit (dfs.container.ratis.leader.pending.bytes.limit) default 1GB. So further cache will block till all follower gets sync. This will be correct controlling write load over DN till all cache in sync with majority of follower.

Attachments

Issue Links

causes

HDDS-8299 Disk full situation on a leader DN may result in followers getting stuck in a retry loop

Resolved

links to

GitHub Pull Request #4499

Activity

People

Assignee:: Sumit Agrawal

Reporter:: Sumit Agrawal

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 30/Mar/23 06:42

Updated:: 23/Aug/23 17:32

Resolved:: 03/Apr/23 16:01