The Linux kernel doesn't allow you to set a timeout on I/O operations, unless you use O_DIRECT and async I/O. If operations with the local filesystem take longer than you would like, what are you going to have the NameNode do? Kill itself? It can't even kill itself if it is hung on a write, because the process will be in "D state," otherwise known as uninterruptible sleep.
The thread who's calling fsync() will be in D state. It does not mean all other threads are also stuck. We can have another thread do this and the one who is doing logSync() can observe and timeout. The stuck thread can later determine whether it should terminate if it eventually returns. The failed EditLogOutputStream will be aborted and EditLogFileOutputStream must do this in non-blocking way. Any subsequent logSync() will avoid the troublesome edits directory.
If the failed one is "required", NN will try to exit, which will hang, but the failure will likely be noticed right away. This still provides a shorter failure detection latency and possibility of keeping service up in case the failed edits directory was not required and NN has no other dependency on it.
This is sort of like simplified user-space-only AIO. The main difference will be that there can only be one outstanding I/O per file descriptor, which doesn't limit our use case.
In this scenario, the NameNode worker thread will be blocked forever, probably while holding the FSImage lock.
In almost all cases I have seen, it was logSync() that first hit an I/O error condition, probably due to its call frequency. You may see what you mentioned if average fsync() interval is not much higher than checkpoint interval. This hasn't been the case in our production env.
There is nothing you can do. You can't kill the thread, and even if you could, how would you get the mutex back? There is nothing Java can do when the OS decides your thread cannot run.
You can't kill it, but you can certainly do other things. Otherwise I would call that MT implementation broken. You can certainly design things to work around this condition.
The solution to your problem is that you can easily set a timeout on NFS operations by using a soft mount plus timeo=60 (or whatever timeout you want).
That only works for NFS.
I fully acknowledge that there are reasonable configurations that do not require this feature to do fast failure detection. But please understand they may be eventually, but are not immediately applicable to all use cases.