[SPARK-15606] Driver hang in o.a.s.DistributedSuite on 2 core machine - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.6.2, 2.0.0
Fix Version/s: 1.6.3, 2.0.0
Component/s: Spark Core
Labels:
None
Environment:

AMD64 box with only 2 cores

Description

repeatedly failing task that crashes JVM *** FAILED ***
The code passed to failAfter did not complete within 100000 milliseconds. (DistributedSuite.scala:128)

This test started failing and DistrbutedSuite hanging following https://github.com/apache/spark/pull/13055

It looks like the extra message to remove the BlockManager deadlocks as there are only 2 message processing loop threads. Related to https://issues.apache.org/jira/browse/SPARK-13906

  /** Thread pool used for dispatching messages. */
  private val threadpool: ThreadPoolExecutor = {
    val numThreads = nettyEnv.conf.getInt("spark.rpc.netty.dispatcher.numThreads",
      math.max(2, Runtime.getRuntime.availableProcessors()))
    val pool = ThreadUtils.newDaemonFixedThreadPool(numThreads, "dispatcher-event-loop")
    for (i <- 0 until numThreads) {
      pool.execute(new MessageLoop)
    }
    pool
  }

Setting a minimum of 3 threads alleviates this issue but I'm not sure there isn't another underlying problem.

Attachments

Issue Links

links to

[Github] Pull Request #13355 (robbinspg)

Activity

People

Assignee:: Peter George Robbins

Reporter:: Peter George Robbins

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 27/May/16 08:39

Updated:: 23/Jun/16 21:43

Resolved:: 02/Jun/16 17:15