Details
Description
We encountered a case where the decommission monitor in the namenode got stuck for about 6 hours. The logs give:
2022-05-15 01:09:25,490 INFO org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping maintenance of dead node 10.185.3.132:50010 2022-05-15 01:10:20,918 INFO org.apache.hadoop.http.HttpServer2: Process Thread Dump: jsp requested <Thread dump> 2022-05-15 01:19:06,810 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: PendingReconstructionMonitor timed out blk_4501753665_3428271426 2022-05-15 01:19:06,810 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: PendingReconstructionMonitor timed out blk_4501753659_3428271420 2022-05-15 01:19:06,810 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: PendingReconstructionMonitor timed out blk_4501753662_3428271423 2022-05-15 01:19:06,810 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: PendingReconstructionMonitor timed out blk_4501753663_3428271424 2022-05-15 06:00:57,281 INFO org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping maintenance of dead node 10.185.3.34:50010 2022-05-15 06:00:58,105 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem write lock held for 17492614 ms via java.lang.Thread.getStackTrace(Thread.java:1559) org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:263) org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:220) org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1601) org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:496) java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang.Thread.run(Thread.java:748) Number of suppressed write-lock reports: 0 Longest write-lock held interval: 17492614
We only have the one thread dump triggered by the FC:
Thread 80 (DatanodeAdminMonitor-0): State: RUNNABLE Blocked count: 16 Waited count: 453693 Stack: org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.check(DatanodeAdminManager.java:538) org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:494) java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang.Thread.run(Thread.java:748)
This was the line of code:
private void check() { final Iterator<Map.Entry<DatanodeDescriptor, AbstractList<BlockInfo>>> it = new CyclicIteration<>(outOfServiceNodeBlocks, iterkey).iterator(); final LinkedList<DatanodeDescriptor> toRemove = new LinkedList<>(); while (it.hasNext() && !exceededNumBlocksPerCheck() && namesystem .isRunning()) { numNodesChecked++; final Map.Entry<DatanodeDescriptor, AbstractList<BlockInfo>> entry = it.next(); final DatanodeDescriptor dn = entry.getKey(); AbstractList<BlockInfo> blocks = entry.getValue(); boolean fullScan = false; if (dn.isMaintenance() && dn.maintenanceExpired()) { // If maintenance expires, stop tracking it. stopMaintenance(dn); toRemove.add(dn); continue; } if (dn.isInMaintenance()) { // The dn is IN_MAINTENANCE and the maintenance hasn't expired yet. continue; //// >>> This line }
With only one sample, we cannot figure out for sure if it is somehow stuck in an infinite loop, but I suspect it is.
The problem is two fold:
1) When we call stopMaintenance(dn), which we must have done as it logged the "Stopping maintenance of dead node", the code looks like:
if (dn.isMaintenance() && dn.maintenanceExpired()) { // If maintenance expires, stop tracking it. stopMaintenance(dn); toRemove.add(dn); continue; }
Notice we add the DN to the toRemove list - this is to avoid concurrent modifications on the outOfServiceNodeBlocks map. However inside stopMaintenance we see:
pendingNodes.remove(node); outOfServiceNodeBlocks.remove(node);
This should cause a concurrentModification exception, which I think will leave the decommission monitor broken until the NN is restarted normally (again I am not sure).
2) In some cases a ConcurrentModificationException is not thrown. This is because the iterator is a CyclicIterator. If the DN removed from maintenance is the one after the iterator start key, and its the second last element in the navigatable map, I believe the CyclicIterator will go into an infinite loop. No concurrent modification exception will get thrown, as the TreeMap iterator has no next(), so it creates a new TreeMap iterator, and then the "stop key" for the cyclic iterator is never found in the map as its now gone, and it just cycles forever. I think that is what happened here.
This test in TestCyclicIteration proves the infinite loop is possible:
@Test public void testInfiniteLoop() { final NavigableMap<Integer, Integer> map = new TreeMap<Integer, Integer>(); map.put(0, 0); map.put(2, 2); map.put(4, 4); map.put(6, 6); System.out.println("map=" + map); // Create iterator to start at 2, which means the first element returned // should be 4. // Inside the treemap iterator, next will already be set to 6 // after 4 is returned. final Iterator<Map.Entry<Integer, Integer>> it = new CyclicIteration<>(map, 2).iterator(); int i = 0; while(it.hasNext()) { Map.Entry<Integer, Integer> e = it.next(); if (i == 0) { map.remove(4); } System.out.println("Iteration #" + i++); } }
The root cause of the decommission monitor problem is the concurrentModification of the outOfServiceNodeBlocks - so we should avoid that.