Details of the NPE:
The JVM did produce a stacktrace on the very first occurance of the NPE. Subsequent ones were missing a stack trace.
The NPE is caused by commitBlockSynchronization() containing a dead node in the new targets. Since block recoveries are issued based on BlockUnderConstructionFeature.replicas (aka expected locations), which is not updated on node death, block recovery can include dead nodes. When commitBlockSynchronization() is called, the expected locations is also updated. (In fact, the whole BlockUnderConstructionFeature is swapped) Each expected location is populated by searching for datanode storage using the storage ID string passed in commitBlockSynchronization(). If the node is dead, the look up returns null.
(Clarification on dead node: the faulty node did try to come back at times and that actually made the situation worse. On re-registration, the existing storages are removed from the datanode descriptor. If it cannot heatbeat for some reason, storage lookup using a storage ID will return null)
If getBlockLocation() is called after this, newLocatedBlock() is called with the expected locations, not with the locations in the blocks map, since it is still under-construction. This calls DatanodeStorageInfo.toDatanodeInfos(), which blows up, as it tries to call getDatanodeDescriptor() of the null storage object.
Proposed solution to the NPE issue:
We can have commitBlockSynchronization() check for valid storage ID before updating data structures. Even if no valid storage ID is found, we can't fail the operation. One or more node did finalize the block, whether they are dead or alive at this moment. It is like a missing block case. We can go ahead and commit the block without the dead node/storage and also allow closing of the file, just like completeFile().
On closing of the file, checkReplication() is called and in our example, this will cause the last block (still in committed state) to be reported as missing. If the dead node comes back, it will include the finalized replica in the block report and that will cause the block to be completed and missing block to be cleared.