Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-15606

Driver hang in o.a.s.DistributedSuite on 2 core machine

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.6.2, 2.0.0
    • Fix Version/s: 1.6.3, 2.0.0
    • Component/s: Spark Core
    • Labels:
      None
    • Environment:

      AMD64 box with only 2 cores

      Description

      repeatedly failing task that crashes JVM *** FAILED ***
      The code passed to failAfter did not complete within 100000 milliseconds. (DistributedSuite.scala:128)

      This test started failing and DistrbutedSuite hanging following https://github.com/apache/spark/pull/13055

      It looks like the extra message to remove the BlockManager deadlocks as there are only 2 message processing loop threads. Related to https://issues.apache.org/jira/browse/SPARK-13906

        /** Thread pool used for dispatching messages. */
        private val threadpool: ThreadPoolExecutor = {
          val numThreads = nettyEnv.conf.getInt("spark.rpc.netty.dispatcher.numThreads",
            math.max(2, Runtime.getRuntime.availableProcessors()))
          val pool = ThreadUtils.newDaemonFixedThreadPool(numThreads, "dispatcher-event-loop")
          for (i <- 0 until numThreads) {
            pool.execute(new MessageLoop)
          }
          pool
        }
      
      

      Setting a minimum of 3 threads alleviates this issue but I'm not sure there isn't another underlying problem.

        Attachments

          Activity

            People

            • Assignee:
              robbinspg Peter George Robbins
              Reporter:
              robbinspg Peter George Robbins
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: