Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-9503

Mesos dispatcher NullPointerException (MesosClusterScheduler)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 1.4.1
    • None
    • Mesos
    • branch-1.4 #8dfdca46dd2f527bf653ea96777b23652bc4eb83

    Description

      Hello,

      I have just started using start-mesos-dispatcher and have been noticing that some random crashes NPE's

      By looking at the exception it looks like in certain situations the "queuedDrivers" is empty and causes the NPE "submission.cores"

      https://github.com/apache/spark/blob/branch-1.4/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala#L512-L516

      log
      15/07/30 23:56:44 INFO MesosRestServer: Started REST server for submitting applications on port 7077
      Exception in thread "Thread-1647" java.lang.NullPointerException
              at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:437)
              at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:436)
              at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
              at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
              at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.scheduleTasks(MesosClusterScheduler.scala:436)
              at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.resourceOffers(MesosClusterScheduler.scala:512)
      I0731 00:53:52.969518  7014 sched.cpp:1625] Asked to abort the driver
      I0731 00:53:52.969895  7014 sched.cpp:861] Aborting framework '20150730-234528-4261456064-5050-61754-0000'
      15/07/31 00:53:52 INFO MesosClusterScheduler: driver.run() returned with code DRIVER_ABORTED
      

      A side effect of this NPE is that after the crash the dispatcher will not start because its already registered #SPARK-7831

      log
      15/07/31 09:55:47 INFO MesosClusterUI: Started MesosClusterUI at http://192.168.0.254:8081
      I0731 09:55:47.715039  8162 sched.cpp:157] Version: 0.23.0
      I0731 09:55:47.717013  8163 sched.cpp:254] New master detected at master@192.168.0.254:5050
      I0731 09:55:47.717381  8163 sched.cpp:264] No credentials provided. Attempting to register without authentication
      I0731 09:55:47.718246  8177 sched.cpp:819] Got error 'Completed framework attempted to re-register'
      I0731 09:55:47.718268  8177 sched.cpp:1625] Asked to abort the driver
      15/07/31 09:55:47 ERROR MesosClusterScheduler: Error received: Completed framework attempted to re-register
      I0731 09:55:47.719091  8177 sched.cpp:861] Aborting framework '20150730-234528-4261456064-5050-61754-0038'
      15/07/31 09:55:47 INFO MesosClusterScheduler: driver.run() returned with code DRIVER_ABORTED
      15/07/31 09:55:47 INFO Utils: Shutdown hook called
      

      I can get around this by removing the zk data:

      zkCli.sh
      rmr /spark_mesos_dispatcher
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            syepes Sebastian YEPES FERNANDEZ
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: