Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-15606

Driver hang in o.a.s.DistributedSuite on 2 core machine

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.6.2, 2.0.0
    • 1.6.3, 2.0.0
    • Spark Core
    • None
    • AMD64 box with only 2 cores

    Description

      repeatedly failing task that crashes JVM *** FAILED ***
      The code passed to failAfter did not complete within 100000 milliseconds. (DistributedSuite.scala:128)

      This test started failing and DistrbutedSuite hanging following https://github.com/apache/spark/pull/13055

      It looks like the extra message to remove the BlockManager deadlocks as there are only 2 message processing loop threads. Related to https://issues.apache.org/jira/browse/SPARK-13906

        /** Thread pool used for dispatching messages. */
        private val threadpool: ThreadPoolExecutor = {
          val numThreads = nettyEnv.conf.getInt("spark.rpc.netty.dispatcher.numThreads",
            math.max(2, Runtime.getRuntime.availableProcessors()))
          val pool = ThreadUtils.newDaemonFixedThreadPool(numThreads, "dispatcher-event-loop")
          for (i <- 0 until numThreads) {
            pool.execute(new MessageLoop)
          }
          pool
        }
      
      

      Setting a minimum of 3 threads alleviates this issue but I'm not sure there isn't another underlying problem.

      Attachments

        Activity

          People

            robbinspg Peter George Robbins
            robbinspg Peter George Robbins
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: