Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-15651

Client could not obtain block when DN CommandProcessingThread exit

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 3.3.1, 3.4.0, 3.2.3, 3.2.4
    • None
    • None
    • Reviewed

    Description

      In our cluster, we applied the HDFS-14997 improvement.
      We find one case that CommandProcessingThread will exit due to OOM error. OOM error was caused by our one abnormal application that running on this DN node.

      2020-10-18 10:27:12,604 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Command processor encountered fatal exception and exit.
      java.lang.OutOfMemoryError: unable to create new native thread
              at java.lang.Thread.start0(Native Method)
              at java.lang.Thread.start(Thread.java:717)
              at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
              at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
              at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173)
              at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222)
              at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005)
              at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671)
              at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:617)
              at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1247)
              at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.access$1000(BPServiceActor.java:1194)
              at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread$3.run(BPServiceActor.java:1299)
              at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1221)
              at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1208)
      

      Here the main point is that CommandProcessingThread crashed will lead a very bad impact. All the NN response commands will not be processed by DN side.

      We enabled the block token to access the data, but here the DN command DNA_ACCESSKEYUPDATE is not processed on time by DN. And then we see lots of Sasl error due to key expiration in DN log:

      javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't re-compute password for block_token_identifier (expiryDate=xxx, keyId=xx, userId=xxx, blockPoolId=xxxx, blockId=xxx, access modes=[READ]), since the required block key (keyID=xxx) doesn't exist.]
      

       

      For the impact in client side, our users receive lots of 'could not obtain block' error  with BlockMissingException.

      CommandProcessingThread is a critical thread, it should always be running.

        /**
         * CommandProcessingThread that process commands asynchronously.
         */
        class CommandProcessingThread extends Thread {
          private final BPServiceActor actor;
          private final BlockingQueue<Runnable> queue;
      
          ...
      
          @Override
          public void run() {
            try {
              processQueue();
            } catch (Throwable t) {
              LOG.error("{} encountered fatal exception and exit.", getName(), t);   <=== should not exit this thread
            }
          }
      

      Once a unexpected error happened, a better handing should be:

      • catch the exception, appropriately deal with the error and let processQueue continue to run
        or
      • exit the DN process to let admin user investigate this

      Attachments

        1. HDFS-15651.001.patch
          3 kB
          Mingxiang Li
        2. HDFS-15651.002.patch
          3 kB
          Mingxiang Li
        3. HDFS-15651.patch
          3 kB
          Mingxiang Li

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Aiphag0 Mingxiang Li
            linyiqun Yiqun Lin
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment