Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-12749

DN may not send block report to NN after NN restart

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.7.1, 2.8.3, 2.7.5, 3.0.0, 2.9.1
    • Fix Version/s: 3.3.0, 2.8.6, 2.9.3, 3.1.4, 3.2.2, 2.10.1
    • Component/s: datanode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Now our cluster have thousands of DN, millions of files and blocks. When NN restart, NN's load is very high.
      After NN restart´╝îDN will call BPServiceActor#reRegister method to register. But register RPC will get a IOException since NN is busy dealing with Block Report. The exception is caught at BPServiceActor#processCommand.
      Next is the caught IOException:

      WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing datanode Command
      java.io.IOException: Failed on local exception: java.io.IOException: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local host is: "DataNode_Host/Datanode_IP"; destination host is: "NameNode_Host":Port;
              at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
              at org.apache.hadoop.ipc.Client.call(Client.java:1474)
              at org.apache.hadoop.ipc.Client.call(Client.java:1407)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
              at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
              at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
              at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
              at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
              at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
              at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
              at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
              at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
              at java.lang.Thread.run(Thread.java:745)
      

      The un-catched IOException breaks BPServiceActor#register, and the Block Report can not be sent immediately.

        /**
         * Register one bp with the corresponding NameNode
         * <p>
         * The bpDatanode needs to register with the namenode on startup in order
         * 1) to report which storage it is serving now and 
         * 2) to receive a registrationID
         *  
         * issued by the namenode to recognize registered datanodes.
         * 
         * @param nsInfo current NamespaceInfo
         * @see FSNamesystem#registerDatanode(DatanodeRegistration)
         * @throws IOException
         */
        void register(NamespaceInfo nsInfo) throws IOException {
          // The handshake() phase loaded the block pool storage
          // off disk - so update the bpRegistration object from that info
          DatanodeRegistration newBpRegistration = bpos.createRegistration();
      
          LOG.info(this + " beginning handshake with NN");
      
          while (shouldRun()) {
            try {
              // Use returned registration from namenode with updated fields
              newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
              newBpRegistration.setNamespaceInfo(nsInfo);
              bpRegistration = newBpRegistration;
              break;
            } catch(EOFException e) {  // namenode might have just restarted
              LOG.info("Problem connecting to server: " + nnAddr + " :"
                  + e.getLocalizedMessage());
              sleepAndLogInterrupts(1000, "connecting to server");
            } catch(SocketTimeoutException e) {  // namenode is busy
              LOG.info("Problem connecting to server: " + nnAddr);
              sleepAndLogInterrupts(1000, "connecting to server");
            }
          }
          
          LOG.info("Block pool " + this + " successfully registered with NN");
          bpos.registrationSucceeded(this, bpRegistration);
      
          // random short delay - helps scatter the BR from all DNs
          scheduler.scheduleBlockReport(dnConf.initialBlockReportDelay);
        }
      

      But NameNode has processed registerDatanode successfully, so it won't ask DN to re-register again

        Attachments

        1. HDFS-12749.001.patch
          6 kB
          TanYuxin
        2. HDFS-12749-branch-2.7.002.patch
          1 kB
          Xiaoqiao He
        3. HDFS-12749-trunk.003.patch
          1 kB
          Xiaoqiao He
        4. HDFS-12749-trunk.004.patch
          2 kB
          Xiaoqiao He
        5. HDFS-12749-trunk.005.patch
          5 kB
          Xiaoqiao He
        6. HDFS-12749-trunk.006.patch
          2 kB
          Xiaoqiao He

          Activity

            People

            • Assignee:
              hexiaoqiao Xiaoqiao He
              Reporter:
              tanyuxin TanYuxin
            • Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: