Now, the NN closes the file and the header sector that has the correct LAYOUT VERSION is flushed to the disk whereas some other pages of the file encountered an error while being flushed
In the attached patch I actually close the file, then re-open with RandomAccessFile to set the header correct. My thinking was that the close() would flush – it's probably not guaranteed, but my thinking is that with most journaled filesystems (eg ext3 w/ data=ordered) the data must be written before the metadata transaction will commit to disk. Therefore on an OS crash or power outage the file will either not have committed its metadata, in which case it will appear empty or not at all, or it will have committed its metadata after flushing the complete data. Whether the rewritten correct LAYOUT_VERSION has been synced is unknown, but I think the ordering is going to be correct.
That said, we should probably stick a FileChannel.force() in there after both writes to be extra safe. It may hurt performance a little bit, but I think this is one of those areas where we should err on the side of safety, yea?
The other question is that the device that shall store the FSImage now needs to be Seekable. Is this the case earlier too?
The current code simply uses a FileOutputStream - there's no abstraction going on. So it is assuming the existence of a normal filesystem with random access. For the edit logs, we can't assume seek currently, but I think it's a fair assumption for the images.