Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-7964

Yarn Scheduler Load Simulator (SLS): MetricsLogRunnable stops working when there are too many jobs needed to load from sls

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 2.7.5, 3.0.0
    • None
    • None
    • I am running sls on a linux server (ubuntu-16.04). The hadoop version is 3.0.0

    Description

      hi, I am using sls to simulate a large scale cluster, which consists more than 100 nodes and runs more than 4k jobs. I found that MetricsLogRunnable (periodically flush real-time metrics to a file) stops working if the sls takes too long to load sls file.

      More specifically, the exception is thrown at here in function String generateRealTimeTrackingMetrics() in SLSWebApp.java :

      for (String queue : wrapper.getQueueSet()) {
      ..........
      }
      

       
      The excepthion is reported as:
      2018-02-22 17:13:59,450 INFO sls.SLSRunner: newly creaed job: 6263127055conainer size: 10queue: default
      java.lang.NullPointerException
      at org.apache.hadoop.yarn.sls.web.SLSWebApp.generateRealTimeTrackingMetrics(SLSWebApp.java:438)
      at org.apache.hadoop.yarn.sls.scheduler.SLSCapacityScheduler$MetricsLogRunnable.run(SLSCapacityScheduler.java:724)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      at java.lang.Thread.run(Thread.java:745)

       

      So the wrapper.getQueueSet() returns a NullPointer which causes the exception.

      After we further analyzing the source code, we noticed that in SLSRunner.java:

      public void start() throws Exception {
          // start resource manager
          startRM();
          // start node managers
          startNM();
          // start application masters
          startAM();
          // set queue & tracked apps information
          ((SchedulerWrapper) rm.getResourceScheduler())
                                  .setQueueSet(this.queueAppNumMap.keySet());
          ((SchedulerWrapper) rm.getResourceScheduler())
                                  .setTrackedAppSet(this.trackedApps);
          // print out simulation info
          printSimulationInfo();
          // blocked until all nodes RUNNING
          waitForNodesRunning();
          // starting the runner once everything is ready to go,
          runner.start();
        }
      

      As you can see the queue set for tracking is set by ((SchedulerWrapper)rm.getResourceScheduler())
      .setQueueSet(this.queueAppNumMap.keySet()); which is done after rm, nm and app initilization. Before the queue set is set, the MetricsLogRunnable has already been lauched. That's the reason why the queue set is empty and cause NullPointerException.
       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              cxcw Wei Chen
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: