[HDDS-2542] Race condition between read and write stateMachineData - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.5.0
Component/s: Ozone Datanode
Labels:
- pull-request-available

Target Version/s:

0.5.0

Description

The write payload (the chunk itself) is sent to the Ratis as an external, binary byte array. It's not part of the LogEntry and saved from an async thread with calling ContainerStateMachine.writeStateMachineData

As it's an async thread it's possible that the stateMachineData is not yet written when the data should be sent to the followers in the next heartbeat.

By design a cache is used to avoid this issue but there are multiple problems with the cache.

First, the current cache size is chunkExecutor.getCorePoolSize() which is not enough. By default it means 60 executor threads and a cache with size 60. But in case of one very slow and 59 very fast writer the cache entries can be invalidated before the write.

In my tests (freon datanode-chunk-writer-generator) I have seen missed cache hits even with cache size 5000.

Second: as the readStateMachineData and writeStateMachien data are called from two different thread there is a race condition independent from the the cache size. It's possible that the write thread has not yet added the data to the cache but the read thread needs it.

Attachments

Issue Links

is required by

HDDS-2701 Avoid read from temporary chunk file in datanode

Resolved

links to

GitHub Pull Request #310

Activity

People

Assignee:: Lokesh Jain

Reporter:: Marton Elek

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 19/Nov/19 12:53

Updated:: 09/Jan/20 14:25

Resolved:: 09/Jan/20 14:25

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

20m