Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-15556

Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Duplicate
    • 3.2.0
    • None
    • namenode
    • None

    Description

      In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN.
      because DataNode is identified as busy and unable to allocate available nodes in choose DataNode, program loop execution results in high CPU and reduces the processing performance of the cluster.

      NameNode the exception stack:

      2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8022, call Call#20535 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from xxxxx:34766
      java.lang.NullPointerException
              at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
              at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
              at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
              at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
              at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
              at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
              at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
              at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
              at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
              at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:422)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
              at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
      
      // DatanodeDescriptor#updateStorageStats
      ...
      for (StorageReport report : reports) {
      
            DatanodeStorageInfo storage = null;
            synchronized (storageMap) {
              storage =
                  storageMap.get(report.getStorage().getStorageID());
            }
            if (checkFailedStorages) {
              failedStorageInfos.remove(storage);
            }
      
            storage.receivedHeartbeat(report);  //  NPE exception occurred here 
            // skip accounting for capacity of PROVIDED storages!
            if (StorageType.PROVIDED.equals(storage.getStorageType())) {
              continue;
            }
      ...
      

      Attachments

        1. HDFS-15556.001.patch
          1 kB
          Haiyang Hu
        2. NN_DN.LOG
          6 kB
          Haiyang Hu
        3. NN-CPU.png
          823 kB
          Haiyang Hu

        Issue Links

          Activity

            People

              Unassigned Unassigned
              haiyang Hu Haiyang Hu
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: