[HDFS-16583] DatanodeAdminDefaultMonitor can get stuck in an infinite loop - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.4.0, 3.2.4, 3.3.5
Fix Version/s: 3.4.0, 3.2.4, 3.3.5
Component/s: datanode
Labels:
- pull-request-available

Target Version/s:

3.4.0, 3.2.4, 3.3.5
Hadoop Flags:

Reviewed

Description

We encountered a case where the decommission monitor in the namenode got stuck for about 6 hours. The logs give:

2022-05-15 01:09:25,490 INFO org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping maintenance of dead node 10.185.3.132:50010
2022-05-15 01:10:20,918 INFO org.apache.hadoop.http.HttpServer2: Process Thread Dump: jsp requested
<Thread dump>
2022-05-15 01:19:06,810 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: PendingReconstructionMonitor timed out blk_4501753665_3428271426
2022-05-15 01:19:06,810 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: PendingReconstructionMonitor timed out blk_4501753659_3428271420
2022-05-15 01:19:06,810 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: PendingReconstructionMonitor timed out blk_4501753662_3428271423
2022-05-15 01:19:06,810 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: PendingReconstructionMonitor timed out blk_4501753663_3428271424
2022-05-15 06:00:57,281 INFO org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping maintenance of dead node 10.185.3.34:50010
2022-05-15 06:00:58,105 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem write lock held for 17492614 ms via
java.lang.Thread.getStackTrace(Thread.java:1559)
org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:263)
org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:220)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1601)
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:496)
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
	Number of suppressed write-lock reports: 0
	Longest write-lock held interval: 17492614

We only have the one thread dump triggered by the FC:

Thread 80 (DatanodeAdminMonitor-0):
  State: RUNNABLE
  Blocked count: 16
  Waited count: 453693
  Stack:
    org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.check(DatanodeAdminManager.java:538)
    org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:494)
    java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    java.lang.Thread.run(Thread.java:748)

This was the line of code:

    private void check() {
      final Iterator<Map.Entry<DatanodeDescriptor, AbstractList<BlockInfo>>>
          it = new CyclicIteration<>(outOfServiceNodeBlocks,
              iterkey).iterator();
      final LinkedList<DatanodeDescriptor> toRemove = new LinkedList<>();

      while (it.hasNext() && !exceededNumBlocksPerCheck() && namesystem
          .isRunning()) {
        numNodesChecked++;
        final Map.Entry<DatanodeDescriptor, AbstractList<BlockInfo>>
            entry = it.next();
        final DatanodeDescriptor dn = entry.getKey();
        AbstractList<BlockInfo> blocks = entry.getValue();
        boolean fullScan = false;
        if (dn.isMaintenance() && dn.maintenanceExpired()) {
          // If maintenance expires, stop tracking it.
          stopMaintenance(dn);
          toRemove.add(dn);
          continue;
        }
        if (dn.isInMaintenance()) {
          // The dn is IN_MAINTENANCE and the maintenance hasn't expired yet.
          continue;  //// >>> This line
        }

With only one sample, we cannot figure out for sure if it is somehow stuck in an infinite loop, but I suspect it is.

The problem is two fold:

1) When we call stopMaintenance(dn), which we must have done as it logged the "Stopping maintenance of dead node", the code looks like:


       if (dn.isMaintenance() && dn.maintenanceExpired()) {
          // If maintenance expires, stop tracking it.
          stopMaintenance(dn);
          toRemove.add(dn);
          continue;
        }

Notice we add the DN to the toRemove list - this is to avoid concurrent modifications on the outOfServiceNodeBlocks map. However inside stopMaintenance we see:

      pendingNodes.remove(node);
      outOfServiceNodeBlocks.remove(node);

This should cause a concurrentModification exception, which I think will leave the decommission monitor broken until the NN is restarted normally (again I am not sure).

2) In some cases a ConcurrentModificationException is not thrown. This is because the iterator is a CyclicIterator. If the DN removed from maintenance is the one after the iterator start key, and its the second last element in the navigatable map, I believe the CyclicIterator will go into an infinite loop. No concurrent modification exception will get thrown, as the TreeMap iterator has no next(), so it creates a new TreeMap iterator, and then the "stop key" for the cyclic iterator is never found in the map as its now gone, and it just cycles forever. I think that is what happened here.

This test in TestCyclicIteration proves the infinite loop is possible:

  @Test
  public void testInfiniteLoop() {
    final NavigableMap<Integer, Integer> map = new TreeMap<Integer, Integer>();
    map.put(0, 0);
    map.put(2, 2);
    map.put(4, 4);
    map.put(6, 6);
    System.out.println("map=" + map);

    // Create iterator to start at 2, which means the first element returned
    // should be 4.
    // Inside the treemap iterator, next will already be set to 6
    // after 4 is returned.
    final Iterator<Map.Entry<Integer, Integer>>
        it =  new CyclicIteration<>(map, 2).iterator();
    int i = 0;
    while(it.hasNext()) {
      Map.Entry<Integer, Integer> e = it.next();
      if (i == 0) {
        map.remove(4);
      }
      System.out.println("Iteration #" + i++);
    }
  }

The root cause of the decommission monitor problem is the concurrentModification of the outOfServiceNodeBlocks - so we should avoid that.

Attachments

Issue Links

links to

GitHub Pull Request #4332

GitHub Pull Request #4394

Activity

People

Assignee:: Stephen O'Donnell

Reporter:: Stephen O'Donnell

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 19/May/22 10:41

Updated:: 11/Feb/24 23:47

Resolved:: 31/May/22 23:56

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: