Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17468

Cluster workers crushed when master network bad more than one WORKER_TIMEOUT_MS!

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.6.1
    • Fix Version/s: None
    • Component/s: Spark Core
    • Environment:

      CentOS 6.5, Spark standalone, 15 machines,15worker and 2master,there are worker,master,driver on the same machine

    • Flags:
      Important

      Description

      I'm from China.My production spark standalone is crushed on 9.9 sales, please help me to tell how to solve this problem,thanks.

      master log is below:

      16/09/09 09:49:57 WARN Master: Removing worker-20160814124907-10.205.130.37-16590 because we got no heartbeat in 60 seconds
      16/09/09 09:49:57 WARN Master: Removing worker-20160814113016-10.205.130.13-57487 because we got no heartbeat in 60 seconds
      16/09/09 09:49:57 WARN Master: Removing worker-20160814134926-10.205.130.39-11430 because we got no heartbeat in 60 seconds
      16/09/09 09:49:57 WARN Master: Removing worker-20160814131257-10.205.130.38-32160 because we got no heartbeat in 60 seconds
      16/09/09 09:49:57 WARN Master: Removing worker-20160814161444-10.205.136.19-14196 because we got no heartbeat in 60 seconds
      16/09/09 09:49:57 WARN Master: Removing worker-20160814141654-10.205.130.42-49707 because we got no heartbeat in 60 seconds
      16/09/09 09:49:57 WARN Master: Removing worker-20160814115125-10.205.130.14-38381 because we got no heartbeat in 60 seconds
      16/09/09 09:49:57 WARN Master: Removing worker-20160814152146-10.205.136.10-24730 because we got no heartbeat in 60 seconds
      16/09/09 09:49:57 WARN Master: Removing worker-20160814122817-10.205.130.36-54348 because we got no heartbeat in 60 seconds
      16/09/09 09:49:57 WARN Master: Removing worker-20160814170452-10.205.136.34-9921 because we got no heartbeat in 60 seconds
      16/09/09 09:49:58 WARN Master: Removing worker-20160814154744-10.205.136.12-12399 because we got no heartbeat in 60 seconds
      16/09/09 09:49:58 WARN Master: Removing worker-20160814150355-10.205.130.44-5792 because we got no heartbeat in 60 seconds
      16/09/09 09:49:58 WARN Master: Removing worker-20160814143901-10.205.130.43-2223 because we got no heartbeat in 60 seconds
      16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814124907-10.205.130.37-16590. Asking it to re-register.
      16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814170452-10.205.136.34-9921. Asking it to re-register.
      16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814141654-10.205.130.42-49707. Asking it to re-register.
      16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814115125-10.205.130.14-38381. Asking it to re-register.
      16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814134926-10.205.130.39-11430. Asking it to re-register.
      16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814131257-10.205.130.38-32160. Asking it to re-register.
      16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814150355-10.205.130.44-5792. Asking it to re-register.
      16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814154744-10.205.136.12-12399. Asking it to re-register.
      16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814161444-10.205.136.19-14196. Asking it to re-register.
      16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814113016-10.205.130.13-57487. Asking it to re-register.
      16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814152146-10.205.136.10-24730. Asking it to re-register.
      16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814143901-10.205.130.43-2223. Asking it to re-register.
      16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814122817-10.205.130.36-54348. Asking it to re-register.
      16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814124907-10.205.130.37-16590. Asking it to re-register.
      16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814170452-10.205.136.34-9921. Asking it to re-register.

      and I found the code here may be wrong, when master network is not ok more than WORKER_TIMEOUT_MS, master will remove worker and executor information in it's memory, but when workers recover connection again with master quickly,because it's old info has been erased on master, despite it still running the old executors, master will allocate more resource than worker can afford,that comes crush my workers.
      So I try to increase WORKER_TIMEOUT_MS to 3 minutes, is that ok?Can you give me some advice?

      code address:
      org.apache.spark.deploy.master.Master,line 1023

      /** Check for, and remove, any timed-out workers */
      private def timeOutDeadWorkers() {
      // Copy the workers into an array so we don't modify the hashset while iterating through it
      val currentTime = System.currentTimeMillis()
      val toRemove = workers.filter(_.lastHeartbeat < currentTime - WORKER_TIMEOUT_MS).toArray
      for (worker <- toRemove) {
      if (worker.state != WorkerState.DEAD)

      { logWarning("Removing %s because we got no heartbeat in %d seconds".format( worker.id, WORKER_TIMEOUT_MS / 1000)) removeWorker(worker) }

      else {
      if (worker.lastHeartbeat < currentTime - ((REAPER_ITERATIONS + 1) * WORKER_TIMEOUT_MS))

      { workers -= worker // we've seen this DEAD worker in the UI, etc. for long enough; cull it }

      }
      }
      }

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              693946948@qq.com zhangzhiyan
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - 168h
                168h
                Remaining:
                Remaining Estimate - 168h
                168h
                Logged:
                Time Spent - Not Specified
                Not Specified