Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-4006

Spark Driver crashes whenever an Executor is registered twice

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 0.9.2, 1.0.2, 1.1.0, 1.2.0
    • 0.9.3, 1.0.3, 1.1.1, 1.2.0
    • Block Manager, Spark Core
    • None
    • Mesos, Coarse Grained

    Description

      This is a huge robustness issue for us (Taboola), in mission critical , time sensitive (real time) spark jobs.

      We have long running spark drivers and even though we have state of the art hardware, from time to time executors disconnect. In many cases, the RemoveExecutor is not received, and when the new executor registers, the driver crashes. In mesos coarse grained, executor ids are fixed.

      The issue is with the System.exit(1) in BlockManagerMasterActor

      private def register(id: BlockManagerId, maxMemSize: Long, slaveActor: ActorRef) {
          if (!blockManagerInfo.contains(id)) {
            blockManagerIdByExecutor.get(id.executorId) match {
              case Some(manager) =>
                // A block manager of the same executor already exists.
                // This should never happen. Let's just quit.
                logError("Got two different block manager registrations on " + id.executorId)
                System.exit(1)
              case None =>
                blockManagerIdByExecutor(id.executorId) = id
            }
      
            logInfo("Registering block manager %s with %s RAM".format(
              id.hostPort, Utils.bytesToString(maxMemSize)))
      
            blockManagerInfo(id) =
              new BlockManagerInfo(id, System.currentTimeMillis(), maxMemSize, slaveActor)
          }
          listenerBus.post(SparkListenerBlockManagerAdded(id, maxMemSize))
        }
      

      Attachments

        Activity

          People

            tsliwowicz Tal Sliwowicz
            tsliwowicz Tal Sliwowicz
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: