Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
2.7.5, 3.0.0
-
None
-
None
-
I am running sls on a linux server (ubuntu-16.04). The hadoop version is 3.0.0
Description
hi, I am using sls to simulate a large scale cluster, which consists more than 100 nodes and runs more than 4k jobs. I found that MetricsLogRunnable (periodically flush real-time metrics to a file) stops working if the sls takes too long to load sls file.
More specifically, the exception is thrown at here in function String generateRealTimeTrackingMetrics() in SLSWebApp.java :
for (String queue : wrapper.getQueueSet()) { .......... }
The excepthion is reported as:
2018-02-22 17:13:59,450 INFO sls.SLSRunner: newly creaed job: 6263127055conainer size: 10queue: default
java.lang.NullPointerException
at org.apache.hadoop.yarn.sls.web.SLSWebApp.generateRealTimeTrackingMetrics(SLSWebApp.java:438)
at org.apache.hadoop.yarn.sls.scheduler.SLSCapacityScheduler$MetricsLogRunnable.run(SLSCapacityScheduler.java:724)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
So the wrapper.getQueueSet() returns a NullPointer which causes the exception.
After we further analyzing the source code, we noticed that in SLSRunner.java:
public void start() throws Exception { // start resource manager startRM(); // start node managers startNM(); // start application masters startAM(); // set queue & tracked apps information ((SchedulerWrapper) rm.getResourceScheduler()) .setQueueSet(this.queueAppNumMap.keySet()); ((SchedulerWrapper) rm.getResourceScheduler()) .setTrackedAppSet(this.trackedApps); // print out simulation info printSimulationInfo(); // blocked until all nodes RUNNING waitForNodesRunning(); // starting the runner once everything is ready to go, runner.start(); }
As you can see the queue set for tracking is set by ((SchedulerWrapper)rm.getResourceScheduler())
.setQueueSet(this.queueAppNumMap.keySet()); which is done after rm, nm and app initilization. Before the queue set is set, the MetricsLogRunnable has already been lauched. That's the reason why the queue set is empty and cause NullPointerException.
Attachments
Issue Links
- duplicates
-
YARN-8632 Threads in SLS quit without logging exception
- Resolved