Description
ThrottledAsyncChecker throws NPE during block pool initialization. The error leads the block pool registration failure.
The exception
2019-05-20 01:02:36,003 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Unexpected exception in block pool Block pool <registering> (Datanode Uuid xxxxx) service to xx.xx.xx.xx/xx.xx.xx.xx java.lang.NullPointerException at org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker$LastCheckResult.access$000(ThrottledAsyncChecker.java:211) at org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker.schedule(ThrottledAsyncChecker.java:129) at org.apache.hadoop.hdfs.server.datanode.checker.DatasetVolumeChecker.checkAllVolumes(DatasetVolumeChecker.java:209) at org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:3387) at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1508) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:319) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:272) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:768) at java.lang.Thread.run(Thread.java:745)
Looks like this error due to WeakHashMap type map completedChecks has removed the target entry while we still get that entry. Although we have done a check before we get it, there is still a chance the entry is got as null.
We met a corner case for this: A federation mode, two block pools in DN, ThrottledAsyncChecker schedules two same health checks for same volume.
2019-05-20 01:02:36,000 INFO org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker: Scheduling a check for /hadoop/2/hdfs/data/current 2019-05-20 01:02:36,000 INFO org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker: Scheduling a check for /hadoop/2/hdfs/data/current
completedChecks cleans up the entry for one successful check after called completedChecks#get. However, after this, another check we get the null.
Attachments
Issue Links
- duplicates
-
HDFS-14074 DataNode runs async disk checks maybe throws NullPointerException, and DataNode failed to register to NameSpace.
- Resolved