Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-12749

DN may not send block report to NN after NN restart

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.7.1, 2.8.3, 2.7.5, 3.0.0, 2.9.1
    • 3.3.0, 2.8.6, 2.9.3, 3.1.4, 3.2.2, 2.10.1
    • datanode
    • None
    • Reviewed

    Description

      Now our cluster have thousands of DN, millions of files and blocks. When NN restart, NN's load is very high.
      After NN restart,DN will call BPServiceActor#reRegister method to register. But register RPC will get a IOException since NN is busy dealing with Block Report. The exception is caught at BPServiceActor#processCommand.
      Next is the caught IOException:

      WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing datanode Command
      java.io.IOException: Failed on local exception: java.io.IOException: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local host is: "DataNode_Host/Datanode_IP"; destination host is: "NameNode_Host":Port;
              at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
              at org.apache.hadoop.ipc.Client.call(Client.java:1474)
              at org.apache.hadoop.ipc.Client.call(Client.java:1407)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
              at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
              at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
              at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
              at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
              at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
              at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
              at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
              at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
              at java.lang.Thread.run(Thread.java:745)
      

      The un-catched IOException breaks BPServiceActor#register, and the Block Report can not be sent immediately.

        /**
         * Register one bp with the corresponding NameNode
         * <p>
         * The bpDatanode needs to register with the namenode on startup in order
         * 1) to report which storage it is serving now and 
         * 2) to receive a registrationID
         *  
         * issued by the namenode to recognize registered datanodes.
         * 
         * @param nsInfo current NamespaceInfo
         * @see FSNamesystem#registerDatanode(DatanodeRegistration)
         * @throws IOException
         */
        void register(NamespaceInfo nsInfo) throws IOException {
          // The handshake() phase loaded the block pool storage
          // off disk - so update the bpRegistration object from that info
          DatanodeRegistration newBpRegistration = bpos.createRegistration();
      
          LOG.info(this + " beginning handshake with NN");
      
          while (shouldRun()) {
            try {
              // Use returned registration from namenode with updated fields
              newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
              newBpRegistration.setNamespaceInfo(nsInfo);
              bpRegistration = newBpRegistration;
              break;
            } catch(EOFException e) {  // namenode might have just restarted
              LOG.info("Problem connecting to server: " + nnAddr + " :"
                  + e.getLocalizedMessage());
              sleepAndLogInterrupts(1000, "connecting to server");
            } catch(SocketTimeoutException e) {  // namenode is busy
              LOG.info("Problem connecting to server: " + nnAddr);
              sleepAndLogInterrupts(1000, "connecting to server");
            }
          }
          
          LOG.info("Block pool " + this + " successfully registered with NN");
          bpos.registrationSucceeded(this, bpRegistration);
      
          // random short delay - helps scatter the BR from all DNs
          scheduler.scheduleBlockReport(dnConf.initialBlockReportDelay);
        }
      

      But NameNode has processed registerDatanode successfully, so it won't ask DN to re-register again

      Attachments

        1. HDFS-12749.001.patch
          6 kB
          Yuxin Tan
        2. HDFS-12749-branch-2.7.002.patch
          1 kB
          Xiaoqiao He
        3. HDFS-12749-trunk.003.patch
          1 kB
          Xiaoqiao He
        4. HDFS-12749-trunk.004.patch
          2 kB
          Xiaoqiao He
        5. HDFS-12749-trunk.005.patch
          5 kB
          Xiaoqiao He
        6. HDFS-12749-trunk.006.patch
          2 kB
          Xiaoqiao He

        Activity

          People

            hexiaoqiao Xiaoqiao He
            tanyuxin Yuxin Tan
            Votes:
            0 Vote for this issue
            Watchers:
            17 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: