Details
Description
2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], permissions=da_music:hdfs:rw-r-----, aclEntries=null, clientName=, clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, txid=32625024993]
java.io.IOException: File is not under construction: ......
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:360)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873)
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361)
<RECORD>
<OPCODE>OP_REASSIGN_LEASE</OPCODE>
<DATA>
<TXID>32625021150</TXID>
<LEASEHOLDER>DFSClient_NONMAPREDUCE_-969060727_197760</LEASEHOLDER>
<PATH>......</PATH>
<NEWHOLDER>DFSClient_NONMAPREDUCE_1000868229_201260</NEWHOLDER>
</DATA>
</RECORD>
......
<RECORD>
<OPCODE>OP_CLOSE</OPCODE>
<DATA>
<TXID>32625023743</TXID>
<LENGTH>0</LENGTH>
<INODEID>0</INODEID>
<PATH>......</PATH>
<REPLICATION>3</REPLICATION>
<MTIME>1581816135883</MTIME>
<ATIME>1581814760398</ATIME>
<BLOCKSIZE>536870912</BLOCKSIZE>
<CLIENT_NAME></CLIENT_NAME>
<CLIENT_MACHINE></CLIENT_MACHINE>
<OVERWRITE>false</OVERWRITE>
<BLOCK>
<BLOCK_ID>5568434562</BLOCK_ID>
<NUM_BYTES>185818644</NUM_BYTES>
<GENSTAMP>4495417845</GENSTAMP>
</BLOCK>
<PERMISSION_STATUS>
<USERNAME>da_music</USERNAME>
<GROUPNAME>hdfs</GROUPNAME>
<MODE>416</MODE>
</PERMISSION_STATUS>
</DATA>
</RECORD>
......
<RECORD>
<OPCODE>OP_TRUNCATE</OPCODE>
<DATA>
<TXID>32625024049</TXID>
<SRC>......</SRC>
<CLIENTNAME>DFSClient_NONMAPREDUCE_1000868229_201260</CLIENTNAME>
<CLIENTMACHINE>......</CLIENTMACHINE>
<NEWLENGTH>185818644</NEWLENGTH>
<TIMESTAMP>1581816136336</TIMESTAMP>
<BLOCK>
<BLOCK_ID>5568434562</BLOCK_ID>
<NUM_BYTES>185818648</NUM_BYTES>
<GENSTAMP>4495417845</GENSTAMP>
</BLOCK>
</DATA>
</RECORD>
......
<RECORD>
<OPCODE>OP_CLOSE</OPCODE>
<DATA>
<TXID>32625024993</TXID>
<LENGTH>0</LENGTH>
<INODEID>0</INODEID>
<PATH>......</PATH>
<REPLICATION>3</REPLICATION>
<MTIME>1581816138774</MTIME>
<ATIME>1581814760398</ATIME>
<BLOCKSIZE>536870912</BLOCKSIZE>
<CLIENT_NAME></CLIENT_NAME>
<CLIENT_MACHINE></CLIENT_MACHINE>
<OVERWRITE>false</OVERWRITE>
<BLOCK>
<BLOCK_ID>5568434562</BLOCK_ID>
<NUM_BYTES>185818644</NUM_BYTES>
<GENSTAMP>4495417845</GENSTAMP>
</BLOCK>
<PERMISSION_STATUS>
<USERNAME>da_music</USERNAME>
<GROUPNAME>hdfs</GROUPNAME>
<MODE>416</MODE>
</PERMISSION_STATUS>
</DATA>
</RECORD>
The block size should be 185818648 in the first CloseOp. When truncate is used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is synchronized to the JournalNode in the same batch. The block used by CloseOp twice is the same instance, which causes the first CloseOp has wrong block size. When SNN rolling Editlog, TruncateOp does not make the file to the UnderConstruction state. Then, when the second CloseOp is executed, the file is not in the UnderConstruction state, and SNN crashes.
Attachments
Attachments
Issue Links
- duplicates
-
HDFS-15391 Standby NameNode due loads the corruption edit log, the service exits and cannot be restarted
- Resolved
- is caused by
-
HDFS-7964 Add support for async edit logging
- Resolved