Hadoop Common
  1. Hadoop Common
  2. HADOOP-4679

Datanode prints tons of log messages: Waiting for threadgroup to exit, active theads is XX

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.18.3
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      1. Only datanode's offerService thread shutdown the datanode to avoid deadlock;
      2. Datanode checks disk in case of failure on creating a block file.
      Show
      1. Only datanode's offerService thread shutdown the datanode to avoid deadlock; 2. Datanode checks disk in case of failure on creating a block file.

      Description

      When a data receiver thread sees a disk error, it immediately calls shutdown to shutdown DataNode. But the shutdown method does not return before all data receiver threads exit, which will never happen. Therefore the DataNode gets into a dead/live lock state, emitting tons of log messages: Waiting for threadgroup to exit, active threads is XX.

      1. diskError3-br18.patch
        7 kB
        Hairong Kuang
      2. diskError3.patch
        9 kB
        Hairong Kuang
      3. diskError2.patch
        8 kB
        Hairong Kuang
      4. diskError1.patch
        6 kB
        Hairong Kuang
      5. diskError.patch
        6 kB
        Hairong Kuang

        Issue Links

          Activity

          Hide
          Hairong Kuang added a comment -

          This patch changes DataNode.shouldRun to be false when a disk error is detected while receiving a block. It also sets a timeout of 10s on DataXceiverServer's server sokcet so the dataXceverServer is able to wake up periodically to check if it should continue to run or not.

          Show
          Hairong Kuang added a comment - This patch changes DataNode.shouldRun to be false when a disk error is detected while receiving a block. It also sets a timeout of 10s on DataXceiverServer's server sokcet so the dataXceverServer is able to wake up periodically to check if it should continue to run or not.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Link this to HADOOP-3574: Better Datanode DiskOutOfSpaceException handling.

          Show
          Tsz Wo Nicholas Sze added a comment - Link this to HADOOP-3574 : Better Datanode DiskOutOfSpaceException handling.
          Hide
          Hairong Kuang added a comment -

          A new patch with minor change to handle a failed test.

          Show
          Hairong Kuang added a comment - A new patch with minor change to handle a failed test.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Could you also include the course e in the new DiskOutOfSpaceException in checkDiskError(...)?

          Show
          Tsz Wo Nicholas Sze added a comment - Could you also include the course e in the new DiskOutOfSpaceException in checkDiskError(...)?
          Hide
          Raghu Angadi added a comment -

          After talking to Hairong:

          1. DataXceiverServer should handle SocketTimeoutException. Right now an idle DN prints exception every 10 seconds.
          2. the timeout for serever socket could be lower.. that test will finish faster.
          3. The unit test need not create files in a tight loop.
          4. immedateShutdown is not really necessary. The way shutdown() works, it should only be called from offerService() thread. I think javadoc JavaDoc should state it explicitly.
          5. The reason log was printed in a tight infinite loop (with out sleep) is that thread inturrupts itself before calling sleep().. so sleep returns immediately!

          I think this should go into 0.18. No one likes disks filling up with these log messages.

          Show
          Raghu Angadi added a comment - After talking to Hairong: DataXceiverServer should handle SocketTimeoutException. Right now an idle DN prints exception every 10 seconds. the timeout for serever socket could be lower.. that test will finish faster. The unit test need not create files in a tight loop. immedateShutdown is not really necessary. The way shutdown() works, it should only be called from offerService() thread. I think javadoc JavaDoc should state it explicitly. The reason log was printed in a tight infinite loop (with out sleep) is that thread inturrupts itself before calling sleep().. so sleep returns immediately! I think this should go into 0.18. No one likes disks filling up with these log messages.
          Hide
          Hairong Kuang added a comment -

          This patch incorporates Raghu's comments except for comment 3. The unit test does not create files in a tight loop. It waits for all replications are created before moving to the next iteration. I tried a few other ways of writing this test. It seems that the current one is most efficient.

          In addition, I made a change to BlockReceiver. If BlockReceiver constructor fails, it checks if it caused by a read-only disk. Since checking read-only disks is an expensive operation, it is performed only when creating the temporary block file fails.

          Show
          Hairong Kuang added a comment - This patch incorporates Raghu's comments except for comment 3. The unit test does not create files in a tight loop. It waits for all replications are created before moving to the next iteration. I tried a few other ways of writing this test. It seems that the current one is most efficient. In addition, I made a change to BlockReceiver. If BlockReceiver constructor fails, it checks if it caused by a read-only disk. Since checking read-only disks is an expensive operation, it is performed only when creating the temporary block file fails.
          Hide
          Raghu Angadi added a comment -
          1. writeToBlock() creates files in two places. The patch catches only one of them.
          2. There is inherent requirement that shutdown() should only be called from offerService thread. It would be better if JavaDoc for shutdown() says this explicitly. Otherwise, this deadlock and logging in tight infinite loop could occur again with future changes.
          Show
          Raghu Angadi added a comment - writeToBlock() creates files in two places. The patch catches only one of them. There is inherent requirement that shutdown() should only be called from offerService thread. It would be better if JavaDoc for shutdown() says this explicitly. Otherwise, this deadlock and logging in tight infinite loop could occur again with future changes.
          Hide
          Hairong Kuang added a comment -

          I do not think it is necessary to check read-only disk for both block flle & meta data file. Checking block file is good enough. I will update the javadoc for shutdown.

          Show
          Hairong Kuang added a comment - I do not think it is necessary to check read-only disk for both block flle & meta data file. Checking block file is good enough. I will update the javadoc for shutdown.
          Hide
          Hairong Kuang added a comment -

          ant test-core passed:
          BUILD SUCCESSFUL
          Total time: 118 minutes 28 seconds

          and so did ant patch:
          [exec] +1 overall.

          [exec] +1 @author. The patch does not contain any @author tags.

          [exec] +1 tests included. The patch appears to include 4 new or modified tests.

          [exec] +1 javadoc. The javadoc tool did not generate any warningmessages.

          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.

          [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          Show
          Hairong Kuang added a comment - ant test-core passed: BUILD SUCCESSFUL Total time: 118 minutes 28 seconds and so did ant patch: [exec] +1 overall. [exec] +1 @author. The patch does not contain any @author tags. [exec] +1 tests included. The patch appears to include 4 new or modified tests. [exec] +1 javadoc. The javadoc tool did not generate any warningmessages. [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
          Hide
          Hairong Kuang added a comment -

          A patch for branch 0.18.

          Show
          Hairong Kuang added a comment - A patch for branch 0.18.
          Hide
          Hairong Kuang added a comment -

          I just committed this.

          Show
          Hairong Kuang added a comment - I just committed this.
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #680 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/680/ )

            People

            • Assignee:
              Hairong Kuang
              Reporter:
              Hairong Kuang
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development