We recently discovered an issue in which, during a rolling upgrade, some jobs were failing with exceptions like (sadly this is the whole stack trace):
with an earlier statement in the log like:
Strangely we did not see any other logs about the DFSOutputStream failing after waiting for the DataNode restart. We eventually realized that in some cases DFSOutputStream#close() may be called more than once, and that if so, the IOException above is thrown on the second call to close() (this is even with
HDFS-5335; prior to this it would have been thrown on all calls to close() besides the first).
The problem is that in DataStreamer#createBlockOutputStream(), after the new output stream is created, it resets the error states:
But it forgets to clear lastException. When DFSOutputStream#closeImpl() is called a second time, this block is triggered:
The second time, isClosed() is true, so the exception checking occurs and the "Datanode is restarting" exception is thrown even though the stream has already been successfully closed.