Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-16697

Add logs if resources are not available in NameNodeResourcePolicy

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 3.1.3
    • 3.4.0
    • namenode
    • Reviewed

    Description

      <property>
        <name>dfs.namenode.resource.checked.volumes.minimum</name>
        <value>1</value>
        <description>
          The minimum number of redundant NameNode storage volumes required.
        </description>
      </property>

      I found that when setting the value of “dfs.namenode.resource.checked.volumes.minimum” is greater than the total number of storage volumes in the NameNode, it is always impossible to turn off the safe mode, and when in safe mode, the file system only accepts read data requests, but not delete, modify and other change requests, which is greatly limited by the function.

      The default value of the configuration item is 1, we set to 2 as an example for illustration, after starting hdfs logs and the client will throw the relevant reminders.

      2022-07-27 17:37:31,772 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: NameNode low on available disk space. Already in safe mode.
      2022-07-27 17:37:31,772 INFO org.apache.hadoop.hdfs.StateChange: STATE* Safe mode is ON.
      Resources are low on NN. Please add or free up more resourcesthen turn off safe mode manually. NOTE:  If you turn off safe mode before adding resources, the NN will immediately return to safe mode. Use "hdfs dfsadmin -safemode leave" to turn safe mode off.
      
      org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create directory /hdfsapi/test. Name node is in safe mode.
      Resources are low on NN. Please add or free up more resourcesthen turn off safe mode manually. NOTE:  If you turn off safe mode before adding resources, the NN will immediately return to safe mode. Use "hdfs dfsadmin -safemode leave" to turn safe mode off. NamenodeHostName:192.168.1.167
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.newSafemodeException(FSNamesystem.java:1468)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1455)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:3174)
              at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:1145)
              at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:714)
              at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527)
              at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
              at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1000)
              at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:928)
              at java.base/java.security.AccessController.doPrivileged(Native Method)
              at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
              at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2916)

      According to the prompt, it is believed that there is not enough resource space to meet the corresponding conditions to close safe mode, but after adding or releasing more resources and lowering the resource condition threshold "dfs.namenode.resource.du.reserved", it still fails to close safe mode and throws the same prompt .

      According to the source code, we know that if the NameNode has redundant storage volumes less than the "dfs.namenode.resource.checked.volumes.minimum" set the minimum number of redundant storage volumes will enter safe mode. After debugging, we found that the current NameNode storage volumes are abundant resource space, but because the total number of NameNode storage volumes is less than the set value, so the number of NameNode storage volumes with redundancy space must also be less than the set value, resulting in always entering safe mode.

      In summary, it is found that the configuration item lacks a condition check and an associated exception handling mechanism, which makes it impossible to find the root cause of the impact when a misconfiguration occurs.

      The solution I propose is to add a mechanism to check the value of this configuration item, it will printing a warning message in the log when the value is greater than the number of NameNode storage volumes in order to solve the problem in time and avoid the misconfiguration from affecting the subsequent operations of the program.

      Attachments

        Issue Links

          Activity

            People

              fujx ECFuzz
              fujx ECFuzz
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 50m
                  1h 50m