[SPARK-3139] Akka timeouts from ContextCleaner when cleaning shuffles - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.1.0
Fix Version/s: 1.1.0
Component/s: None
Labels:
None
Environment:

10 r3.2xlarge tests on EC2, running the scala-agg-by-key-int spark-perf test against master commit d7e80c2597d4a9cae2e0cb35a86f7889323f4cbb.

Target Version/s:

1.1.0

Description

When running spark-perf tests on EC2, I have a job that's consistently logging the following Akka exceptions:

4/08/19 22:07:12 ERROR spark.ContextCleaner: Error cleaning shuffle 0
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
  at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
  at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
  at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
  at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
  at scala.concurrent.Await$.result(package.scala:107)
  at org.apache.spark.storage.BlockManagerMaster.removeShuffle(BlockManagerMaster.scala:118)
  at org.apache.spark.ContextCleaner.doCleanupShuffle(ContextCleaner.scala:159)
  at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:131)
  at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:124)
  at scala.Option.foreach(Option.scala:236)
  at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:124)
  at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:120)
  at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:120)
  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1252)
  at org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:119)
  at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:65)

and

14/08/19 22:07:12 ERROR storage.BlockManagerMaster: Failed to remove shuffle 0
akka.pattern.AskTimeoutException: Timed out
  at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334)
  at akka.actor.Scheduler$$anon$11.run(Scheduler.scala:118)
  at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
  at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691)
  at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:455)
  at akka.actor.LightArrayRevolverScheduler$$anon$12.executeBucket$1(Scheduler.scala:407)
  at akka.actor.LightArrayRevolverScheduler$$anon$12.nextTick(Scheduler.scala:411)
  at akka.actor.LightArrayRevolverScheduler$$anon$12.run(Scheduler.scala:363)
  at java.lang.Thread.run(Thread.java:745)

This doesn't seem to prevent the job from completing successfully, but it's serious issue because it means that resources aren't being cleaned up. The test script, ScalaAggByKeyInt, runs each test 10 times, and I see the same error after each test, so this seems deterministically reproducible.

I'll look at the executor logs to see if I can find more info there.

Attachments

Issue Links

is related to

SPARK-3015 Removing broadcast in quick successions causes Akka timeout

Resolved

links to

[Github] Pull Request #2056 (witgo)

[Github] Pull Request #2143 (tdas)

Activity

People

Assignee:: Guoqiang Li

Reporter:: Josh Rosen

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 19/Aug/14 22:19

Updated:: 27/Aug/14 07:18

Resolved:: 27/Aug/14 07:18