Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
YARN-9264 accidentally introduced a bug in FpgaDiscoverer. Sometimes currentFpgaInfo is not set, resulting in an NPE being thrown:
2019-06-03 05:14:50,157 INFO org.apache.hadoop.service.AbstractService: Service NodeManager failed in state INITED; cause: java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaNodeResourceUpdateHandler.updateConfiguredResource(FpgaNodeResourceUpdateHandler.java:54) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.updateConfiguredResourcesViaPlugins(NodeStatusUpdaterImpl.java:358) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceInit(NodeStatusUpdaterImpl.java:190) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:459) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:869) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:942)
The problem is that in FpgaDiscoverer, we don't set currentFpgaInfo if the following condition is true:
if (allowed == null || allowed.equalsIgnoreCase( YarnConfiguration.AUTOMATICALLY_DISCOVER_GPU_DEVICES)) { return list; } else if (allowed.matches("(\\d,)*\\d")){ ...
Solution is simple: initialize it in both code-paths.
Unit tests should be enhanced to verify that it's set properly.