Details
Description
closing the edits log file can race with write to edits log file operation resulting in OP_INVALID end-of-file marker being initially overwritten by the concurrent (in setReadyToFlush) threads and then removed twice from the buffer, losing a good byte from edits log.
Example:
FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush() FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> FSEditLog.closeStream() -> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync() OR FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> FSEditLog.closeStream() ->EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync() VERSUS FSNameSystem.completeFile -> FSEditLog.logSync() -> EditLogOutputStream.setReadyToFlush() FSNameSystem.completeFile -> FSEditLog.logSync() -> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync() OR Any FSEditLog.write
Access on the edits flush operations is synchronized only in the FSEdits.logSync() method level. However at a lower level access to EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT synchronized. These can be called from concurrent threads like in the example above
So if a rollEditLog or rollFSIMage is happening at the same time with a write operation it can race for EditLogFileOutputStream.setReadyToFlush that will overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the "end-of-file marker") and then remove it twice (from each thread) in flushAndSync()! Hence there will be a valid byte missing from the edits log that leads to a SecondaryNameNode silent failure and a full HDFS failure upon cluster restart.
We got to this point after investigating a corrupted edits file that made HDFS unable to start with
java.io.IOException: Incorrect data format. logVersion is -20 but writables.length is 768. at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450
EDIT: moved the logs to a comment to make this readable
Attachments
Attachments
Issue Links
- is blocked by
-
HADOOP-6554 DelegationTokenSecretManager lifecycle is inconsistent
- Resolved
- is related to
-
HDFS-980 Convert FSNamesystem lock to ReadWriteLock
- Resolved