Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-15175

Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog

    XMLWordPrintableJSON

Details

    Description

       

      Crash exception

      2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], permissions=da_music:hdfs:rw-r-----, aclEntries=null, clientName=, clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, txid=32625024993]
      java.io.IOException: File is not under construction: ......
      at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442)
      at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237)
      at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146)
      at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891)
      at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872)
      at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262)
      at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395)
      at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348)
      at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:360)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873)
      at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479)
      at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361)

       

      Editlog

      <RECORD>
      <OPCODE>OP_REASSIGN_LEASE</OPCODE>
      <DATA>
      <TXID>32625021150</TXID>
      <LEASEHOLDER>DFSClient_NONMAPREDUCE_-969060727_197760</LEASEHOLDER>
      <PATH>......</PATH>
      <NEWHOLDER>DFSClient_NONMAPREDUCE_1000868229_201260</NEWHOLDER>
      </DATA>
      </RECORD>

      ......

      <RECORD>
      <OPCODE>OP_CLOSE</OPCODE>
      <DATA>
      <TXID>32625023743</TXID>
      <LENGTH>0</LENGTH>
      <INODEID>0</INODEID>
      <PATH>......</PATH>
      <REPLICATION>3</REPLICATION>
      <MTIME>1581816135883</MTIME>
      <ATIME>1581814760398</ATIME>
      <BLOCKSIZE>536870912</BLOCKSIZE>
      <CLIENT_NAME></CLIENT_NAME>
      <CLIENT_MACHINE></CLIENT_MACHINE>
      <OVERWRITE>false</OVERWRITE>
      <BLOCK>
      <BLOCK_ID>5568434562</BLOCK_ID>
      <NUM_BYTES>185818644</NUM_BYTES>
      <GENSTAMP>4495417845</GENSTAMP>
      </BLOCK>
      <PERMISSION_STATUS>
      <USERNAME>da_music</USERNAME>
      <GROUPNAME>hdfs</GROUPNAME>
      <MODE>416</MODE>
      </PERMISSION_STATUS>
      </DATA>
      </RECORD>

      ......

      <RECORD>
      <OPCODE>OP_TRUNCATE</OPCODE>
      <DATA>
      <TXID>32625024049</TXID>
      <SRC>......</SRC>
      <CLIENTNAME>DFSClient_NONMAPREDUCE_1000868229_201260</CLIENTNAME>
      <CLIENTMACHINE>......</CLIENTMACHINE>
      <NEWLENGTH>185818644</NEWLENGTH>
      <TIMESTAMP>1581816136336</TIMESTAMP>
      <BLOCK>
      <BLOCK_ID>5568434562</BLOCK_ID>
      <NUM_BYTES>185818648</NUM_BYTES>
      <GENSTAMP>4495417845</GENSTAMP>
      </BLOCK>
      </DATA>
      </RECORD>

      ......

      <RECORD>
      <OPCODE>OP_CLOSE</OPCODE>
      <DATA>
      <TXID>32625024993</TXID>
      <LENGTH>0</LENGTH>
      <INODEID>0</INODEID>
      <PATH>......</PATH>
      <REPLICATION>3</REPLICATION>
      <MTIME>1581816138774</MTIME>
      <ATIME>1581814760398</ATIME>
      <BLOCKSIZE>536870912</BLOCKSIZE>
      <CLIENT_NAME></CLIENT_NAME>
      <CLIENT_MACHINE></CLIENT_MACHINE>
      <OVERWRITE>false</OVERWRITE>
      <BLOCK>
      <BLOCK_ID>5568434562</BLOCK_ID>
      <NUM_BYTES>185818644</NUM_BYTES>
      <GENSTAMP>4495417845</GENSTAMP>
      </BLOCK>
      <PERMISSION_STATUS>
      <USERNAME>da_music</USERNAME>
      <GROUPNAME>hdfs</GROUPNAME>
      <MODE>416</MODE>
      </PERMISSION_STATUS>
      </DATA>
      </RECORD>

       

       

      The block size should be 185818648 in the first CloseOp. When truncate is used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is synchronized to the JournalNode in the same batch. The block used by CloseOp twice is the same instance, which causes the first CloseOp has wrong block size. When SNN rolling Editlog, TruncateOp does not make the file to the UnderConstruction state. Then, when the second CloseOp is executed, the file is not in the UnderConstruction state, and SNN crashes.

      Attachments

        1. HDFS-15175-trunk.1.patch
          2 kB
          Wan Chang

        Issue Links

          Activity

            People

              wanchang Wan Chang
              caiyicong Yicong Cai
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: