Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-15175

Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 2.9.2
    • Fix Version/s: 3.4.0, 3.2.3, 3.3.2, 3.2.4
    • Component/s: None
    • Labels:
    • Hadoop Flags:
      Reviewed

      Description

       

      Crash exception

      2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], permissions=da_music:hdfs:rw-r-----, aclEntries=null, clientName=, clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, txid=32625024993]
      java.io.IOException: File is not under construction: ......
      at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442)
      at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237)
      at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146)
      at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891)
      at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872)
      at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262)
      at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395)
      at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348)
      at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:360)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873)
      at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479)
      at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361)

       

      Editlog

      <RECORD>
      <OPCODE>OP_REASSIGN_LEASE</OPCODE>
      <DATA>
      <TXID>32625021150</TXID>
      <LEASEHOLDER>DFSClient_NONMAPREDUCE_-969060727_197760</LEASEHOLDER>
      <PATH>......</PATH>
      <NEWHOLDER>DFSClient_NONMAPREDUCE_1000868229_201260</NEWHOLDER>
      </DATA>
      </RECORD>

      ......

      <RECORD>
      <OPCODE>OP_CLOSE</OPCODE>
      <DATA>
      <TXID>32625023743</TXID>
      <LENGTH>0</LENGTH>
      <INODEID>0</INODEID>
      <PATH>......</PATH>
      <REPLICATION>3</REPLICATION>
      <MTIME>1581816135883</MTIME>
      <ATIME>1581814760398</ATIME>
      <BLOCKSIZE>536870912</BLOCKSIZE>
      <CLIENT_NAME></CLIENT_NAME>
      <CLIENT_MACHINE></CLIENT_MACHINE>
      <OVERWRITE>false</OVERWRITE>
      <BLOCK>
      <BLOCK_ID>5568434562</BLOCK_ID>
      <NUM_BYTES>185818644</NUM_BYTES>
      <GENSTAMP>4495417845</GENSTAMP>
      </BLOCK>
      <PERMISSION_STATUS>
      <USERNAME>da_music</USERNAME>
      <GROUPNAME>hdfs</GROUPNAME>
      <MODE>416</MODE>
      </PERMISSION_STATUS>
      </DATA>
      </RECORD>

      ......

      <RECORD>
      <OPCODE>OP_TRUNCATE</OPCODE>
      <DATA>
      <TXID>32625024049</TXID>
      <SRC>......</SRC>
      <CLIENTNAME>DFSClient_NONMAPREDUCE_1000868229_201260</CLIENTNAME>
      <CLIENTMACHINE>......</CLIENTMACHINE>
      <NEWLENGTH>185818644</NEWLENGTH>
      <TIMESTAMP>1581816136336</TIMESTAMP>
      <BLOCK>
      <BLOCK_ID>5568434562</BLOCK_ID>
      <NUM_BYTES>185818648</NUM_BYTES>
      <GENSTAMP>4495417845</GENSTAMP>
      </BLOCK>
      </DATA>
      </RECORD>

      ......

      <RECORD>
      <OPCODE>OP_CLOSE</OPCODE>
      <DATA>
      <TXID>32625024993</TXID>
      <LENGTH>0</LENGTH>
      <INODEID>0</INODEID>
      <PATH>......</PATH>
      <REPLICATION>3</REPLICATION>
      <MTIME>1581816138774</MTIME>
      <ATIME>1581814760398</ATIME>
      <BLOCKSIZE>536870912</BLOCKSIZE>
      <CLIENT_NAME></CLIENT_NAME>
      <CLIENT_MACHINE></CLIENT_MACHINE>
      <OVERWRITE>false</OVERWRITE>
      <BLOCK>
      <BLOCK_ID>5568434562</BLOCK_ID>
      <NUM_BYTES>185818644</NUM_BYTES>
      <GENSTAMP>4495417845</GENSTAMP>
      </BLOCK>
      <PERMISSION_STATUS>
      <USERNAME>da_music</USERNAME>
      <GROUPNAME>hdfs</GROUPNAME>
      <MODE>416</MODE>
      </PERMISSION_STATUS>
      </DATA>
      </RECORD>

       

       

      The block size should be 185818648 in the first CloseOp. When truncate is used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is synchronized to the JournalNode in the same batch. The block used by CloseOp twice is the same instance, which causes the first CloseOp has wrong block size. When SNN rolling Editlog, TruncateOp does not make the file to the UnderConstruction state. Then, when the second CloseOp is executed, the file is not in the UnderConstruction state, and SNN crashes.

        Attachments

        Issue Links

          Activity

          $i18n.getText('security.level.explanation', $currentSelection) Viewable by All Users
          Cancel

            People

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment