Thanks Wangda Tan for review and comments!
even after R/W lock changes, when anything bad happens on disks, DirectoryCollection will be stuck under write locks, so NodeStatusUpdater will be blocked as well.
Not really. From jstack above, you can see the pending operation on busy IO happen in below is out of any lock now.
Map<String, DiskErrorInformation> dirsFailedCheck = testDirs(allLocalDirs,
So NodeStatusUpdater won't get blocked when testDirs pending on operation of mkdir.
1) In short term, errorDirs/fullDirs/localDirs are copy-on-write list, so we don't need to acquire lock getGoodDirs/getFailedDirs/getFailedDirs. This could lead to inconsistency data in rare cases, but I think in general this is safe and inconsistency data will be updated in next heartbeat.
In general, read/write lock is more flexible and more consistent as we have several resources under race condition. Copy-on-write list only can guarantee no modification exception happen between a read and write operation on the same list, but no way to provide consistent semantic across lists. Thus, I would prefer to use read/write lock here and CopyOnWriteArrayList can be replaced with plain Arraylist. Isn't it?
2) In longer term, we may need to consider a DirectoryCollection stuck under busy IO is unhealthy state, NodeStatusUpdater should be able to report such status to RM, so RM will avoid allocating any new containers to such nodes.
I agree we should provide better IO control on each node of YARN cluster. We can report some unhealthy status when IO get stuck or even better to count IO load as a resource for better/smart scheduling. However, how to better react for the very busy IO case is a different topic for the problem try to get resolved in this JIRA. In any case, NM heartbeat is not supposed to be cut-off unless daemon crash.