In our cluster, an application is hung when doing a short circuit read of local hdfs block. By looking into the log, we found the DataNode's DomainSocketWatcher.watcherThread has exited with following log:
The line 463 is following code snippet:
getAndClearReadableFds is a native method which will malloc an int array. Since our memory is very tight, it looks like the malloc failed and a NULL pointer is returned.
The bad thing is that other threads then blocked in stack like this:
IMO, we should exit the DN so that the users can know that something go wrong and fix it.