Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-16583

DatanodeAdminDefaultMonitor can get stuck in an infinite loop

    XMLWordPrintableJSON

Details

    Description

      We encountered a case where the decommission monitor in the namenode got stuck for about 6 hours. The logs give:

      2022-05-15 01:09:25,490 INFO org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping maintenance of dead node 10.185.3.132:50010
      2022-05-15 01:10:20,918 INFO org.apache.hadoop.http.HttpServer2: Process Thread Dump: jsp requested
      <Thread dump>
      2022-05-15 01:19:06,810 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: PendingReconstructionMonitor timed out blk_4501753665_3428271426
      2022-05-15 01:19:06,810 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: PendingReconstructionMonitor timed out blk_4501753659_3428271420
      2022-05-15 01:19:06,810 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: PendingReconstructionMonitor timed out blk_4501753662_3428271423
      2022-05-15 01:19:06,810 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: PendingReconstructionMonitor timed out blk_4501753663_3428271424
      2022-05-15 06:00:57,281 INFO org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping maintenance of dead node 10.185.3.34:50010
      2022-05-15 06:00:58,105 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem write lock held for 17492614 ms via
      java.lang.Thread.getStackTrace(Thread.java:1559)
      org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
      org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:263)
      org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:220)
      org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1601)
      org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:496)
      java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
      java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
      java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
      java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      java.lang.Thread.run(Thread.java:748)
      	Number of suppressed write-lock reports: 0
      	Longest write-lock held interval: 17492614
      

      We only have the one thread dump triggered by the FC:

      Thread 80 (DatanodeAdminMonitor-0):
        State: RUNNABLE
        Blocked count: 16
        Waited count: 453693
        Stack:
          org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.check(DatanodeAdminManager.java:538)
          org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:494)
          java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
          java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
          java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
          java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          java.lang.Thread.run(Thread.java:748)
      

      This was the line of code:

          private void check() {
            final Iterator<Map.Entry<DatanodeDescriptor, AbstractList<BlockInfo>>>
                it = new CyclicIteration<>(outOfServiceNodeBlocks,
                    iterkey).iterator();
            final LinkedList<DatanodeDescriptor> toRemove = new LinkedList<>();
      
            while (it.hasNext() && !exceededNumBlocksPerCheck() && namesystem
                .isRunning()) {
              numNodesChecked++;
              final Map.Entry<DatanodeDescriptor, AbstractList<BlockInfo>>
                  entry = it.next();
              final DatanodeDescriptor dn = entry.getKey();
              AbstractList<BlockInfo> blocks = entry.getValue();
              boolean fullScan = false;
              if (dn.isMaintenance() && dn.maintenanceExpired()) {
                // If maintenance expires, stop tracking it.
                stopMaintenance(dn);
                toRemove.add(dn);
                continue;
              }
              if (dn.isInMaintenance()) {
                // The dn is IN_MAINTENANCE and the maintenance hasn't expired yet.
                continue;  //// >>> This line
              }
      

      With only one sample, we cannot figure out for sure if it is somehow stuck in an infinite loop, but I suspect it is.

      The problem is two fold:

      1) When we call stopMaintenance(dn), which we must have done as it logged the "Stopping maintenance of dead node", the code looks like:

      
             if (dn.isMaintenance() && dn.maintenanceExpired()) {
                // If maintenance expires, stop tracking it.
                stopMaintenance(dn);
                toRemove.add(dn);
                continue;
              }
      

      Notice we add the DN to the toRemove list - this is to avoid concurrent modifications on the outOfServiceNodeBlocks map. However inside stopMaintenance we see:

            pendingNodes.remove(node);
            outOfServiceNodeBlocks.remove(node);
      

      This should cause a concurrentModification exception, which I think will leave the decommission monitor broken until the NN is restarted normally (again I am not sure).

      2) In some cases a ConcurrentModificationException is not thrown. This is because the iterator is a CyclicIterator. If the DN removed from maintenance is the one after the iterator start key, and its the second last element in the navigatable map, I believe the CyclicIterator will go into an infinite loop. No concurrent modification exception will get thrown, as the TreeMap iterator has no next(), so it creates a new TreeMap iterator, and then the "stop key" for the cyclic iterator is never found in the map as its now gone, and it just cycles forever. I think that is what happened here.

      This test in TestCyclicIteration proves the infinite loop is possible:

        @Test
        public void testInfiniteLoop() {
          final NavigableMap<Integer, Integer> map = new TreeMap<Integer, Integer>();
          map.put(0, 0);
          map.put(2, 2);
          map.put(4, 4);
          map.put(6, 6);
          System.out.println("map=" + map);
      
          // Create iterator to start at 2, which means the first element returned
          // should be 4.
          // Inside the treemap iterator, next will already be set to 6
          // after 4 is returned.
          final Iterator<Map.Entry<Integer, Integer>>
              it =  new CyclicIteration<>(map, 2).iterator();
          int i = 0;
          while(it.hasNext()) {
            Map.Entry<Integer, Integer> e = it.next();
            if (i == 0) {
              map.remove(4);
            }
            System.out.println("Iteration #" + i++);
          }
        }
      

      The root cause of the decommission monitor problem is the concurrentModification of the outOfServiceNodeBlocks - so we should avoid that.

      Attachments

        Activity

          People

            sodonnell Stephen O'Donnell
            sodonnell Stephen O'Donnell
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 2h
                2h