Details
Description
In our cluster, we applied the HDFS-14997 improvement.
We find one case that CommandProcessingThread will exit due to OOM error. OOM error was caused by our one abnormal application that running on this DN node.
2020-10-18 10:27:12,604 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Command processor encountered fatal exception and exit. java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:617) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1247) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.access$1000(BPServiceActor.java:1194) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread$3.run(BPServiceActor.java:1299) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1221) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1208)
Here the main point is that CommandProcessingThread crashed will lead a very bad impact. All the NN response commands will not be processed by DN side.
We enabled the block token to access the data, but here the DN command DNA_ACCESSKEYUPDATE is not processed on time by DN. And then we see lots of Sasl error due to key expiration in DN log:
javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't re-compute password for block_token_identifier (expiryDate=xxx, keyId=xx, userId=xxx, blockPoolId=xxxx, blockId=xxx, access modes=[READ]), since the required block key (keyID=xxx) doesn't exist.]
For the impact in client side, our users receive lots of 'could not obtain block' error with BlockMissingException.
CommandProcessingThread is a critical thread, it should always be running.
/** * CommandProcessingThread that process commands asynchronously. */ class CommandProcessingThread extends Thread { private final BPServiceActor actor; private final BlockingQueue<Runnable> queue; ... @Override public void run() { try { processQueue(); } catch (Throwable t) { LOG.error("{} encountered fatal exception and exit.", getName(), t); <=== should not exit this thread } }
Once a unexpected error happened, a better handing should be:
- catch the exception, appropriately deal with the error and let processQueue continue to run
or - exit the DN process to let admin user investigate this
Attachments
Attachments
Issue Links
- is related to
-
HDFS-14997 BPServiceActor processes commands from NameNode asynchronously
- Resolved
- relates to
-
HDFS-14997 BPServiceActor processes commands from NameNode asynchronously
- Resolved