Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-14579

In refreshNodes, avoid performing a DNS lookup while holding the write lock

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Not A Problem
    • Affects Version/s: 3.3.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Target Version/s:

      Description

      When refreshNodes is called on a large cluster, or a cluster where DNS is not performing well, it can cause the namenode to hang for a long time. This is because the refreshNodes operation holds the global write lock while it is running. Most of refreshNodes code is simple and hence fast, but unfortunately it performs a DNS lookup for each host in the cluster while the lock is held.

      Right now, it calls:

        public void refreshNodes(final Configuration conf) throws IOException {
          refreshHostsReader(conf);
          namesystem.writeLock();
          try {
            refreshDatanodes();
            countSoftwareVersions();
          } finally {
            namesystem.writeUnlock();
          }
        }
      

      The line refreshHostsReader(conf); reads the new config file and does a DNS lookup on each entry - the write lock is not held here. Then the main work is done here:

        private void refreshDatanodes() {
          final Map<String, DatanodeDescriptor> copy;
          synchronized (this) {
            copy = new HashMap<>(datanodeMap);
          }
          for (DatanodeDescriptor node : copy.values()) {
            // Check if not include.
            if (!hostConfigManager.isIncluded(node)) {
              node.setDisallowed(true);
            } else {
              long maintenanceExpireTimeInMS =
                  hostConfigManager.getMaintenanceExpirationTimeInMS(node);
              if (node.maintenanceNotExpired(maintenanceExpireTimeInMS)) {
                datanodeAdminManager.startMaintenance(
                    node, maintenanceExpireTimeInMS);
              } else if (hostConfigManager.isExcluded(node)) {
                datanodeAdminManager.startDecommission(node);
              } else {
                datanodeAdminManager.stopMaintenance(node);
                datanodeAdminManager.stopDecommission(node);
              }
            }
            node.setUpgradeDomain(hostConfigManager.getUpgradeDomain(node));
          }
        }
      

      All the isIncluded(), isExcluded() methods call node.getResolvedAddress() which does the DNS lookup. We could probably change things to perform all the DNS lookups outside of the write lock, and then take the lock and process the nodes. Also change or overload isIncluded() etc to take the inetAddress rather than the datanode descriptor.

      It would not shorten the time the operation takes to run overall, but it would move the long duration out of the write lock and avoid blocking the namenode for the entire time.

        Attachments

        1. HDFS-14579.001.patch
          10 kB
          Stephen O'Donnell

          Activity

            People

            • Assignee:
              sodonnell Stephen O'Donnell
              Reporter:
              sodonnell Stephen O'Donnell
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: