There are two core functions, report(#sendHeartbeat, #blockReport, #cacheReport) and #processCommand in #BPServiceActor main process flow. If processCommand cost long time it will block send report flow. Meanwhile processCommand could cost long time(over 1000s the worst case I meet) when IO load of DataNode is very high. Since some IO operations are under #datasetLock, So it has to wait to acquire #datasetLock long time when process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat will not send to NameNode in-time, and trigger other disasters.
I propose to improve #processCommand asynchronously and not block #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
1. Lifeline could be one effective solution, however some old branches are not support this feature.
2. IO operations under #datasetLock is another issue, I think we should solve it at another JIRA.