[SPARK-4454] Race condition in DAGScheduler - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.1.0
Fix Version/s: 1.3.0
Component/s: Scheduler, Spark Core
Labels:
None

Description

It seems to be a race condition in DAGScheduler that manifests on jobs with high concurrency:

 Exception in thread "main" java.util.NoSuchElementException: key not found: 35
        at scala.collection.MapLike$class.default(MapLike.scala:228)
        at scala.collection.AbstractMap.default(Map.scala:58)
        at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
        at org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:201)
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1292)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
        at scala.collection.immutable.List.foreach(List.scala:318)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
        at scala.collection.immutable.List.foreach(List.scala:318)
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
        at scala.collection.immutable.List.foreach(List.scala:318)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
        at scala.collection.immutable.List.foreach(List.scala:318)
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
        at scala.collection.immutable.List.foreach(List.scala:318)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
        at scala.collection.immutable.List.foreach(List.scala:318)
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
        at org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1275)
        at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:937)
        at org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:175)
        at org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:192)
        at org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:191)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
        at org.apache.spark.rdd.PartitionCoalescer$LocationIterator.next(CoalescedRDD.scala:203)
        at org.apache.spark.rdd.PartitionCoalescer.setupGroups(CoalescedRDD.scala:257)
        at org.apache.spark.rdd.PartitionCoalescer.run(CoalescedRDD.scala:338)
        at org.apache.spark.rdd.CoalescedRDD.getPartitions(CoalescedRDD.scala:84)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1150)
        at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:995)
        at me.wwsga.driveclub.EnhancedRDD.saveAsPartitioned(Enhanced.scala:53)
        at Import$$anonfun$22$$anonfun$apply$9$$anonfun$apply$10.apply(Import.scala:186)
        at Import$$anonfun$22$$anonfun$apply$9$$anonfun$apply$10.apply(Import.scala:181)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Code:

  private def getCacheLocs(rdd: RDD[_]): Array[Seq[TaskLocation]] = {
    if (!cacheLocs.contains(rdd.id)) {
      val blockIds = rdd.partitions.indices.map(index => RDDBlockId(rdd.id, index)).toArray[BlockId]
      val locs = BlockManager.blockIdsToBlockManagers(blockIds, env, blockManagerMaster)
      cacheLocs(rdd.id) = blockIds.map { id =>
        locs.getOrElse(id, Nil).map(bm => TaskLocation(bm.host, bm.executorId))
      }
    }
    cacheLocs(rdd.id)
  }

Probably getOrElseUpdate pattern would be better for this code.

Attachments

Issue Links

is duplicated by

SPARK-2002 Race condition in accessing cache locations in DAGScheduler

Resolved

links to

[Github] Pull Request #3345 (mag-)

[Github] Pull Request #4660 (JoshRosen)

Activity

Ascending order - Click to sort in descending order

Apache Spark added a comment - 18/Nov/14 11:29

User 'mag-' has created a pull request for this issue:
https://github.com/apache/spark/pull/3345

Apache Spark added a comment - 18/Nov/14 11:29 User 'mag-' has created a pull request for this issue: https://github.com/apache/spark/pull/3345

Sean R. Owen added a comment - 17/Feb/15 19:30

Per PR discussion

Sean R. Owen added a comment - 17/Feb/15 19:30 Per PR discussion

Josh Rosen added a comment - 17/Feb/15 19:34

I'm re-opening this issue. srowen, we shouldn't resolve this as "Won't Fix"; that PR was closed because it might not be the proper fix for this bug, but we should still work on a fix. I'm investigating this now and will submit a PR shortly.

Josh Rosen added a comment - 17/Feb/15 19:34 I'm re-opening this issue. srowen , we shouldn't resolve this as "Won't Fix"; that PR was closed because it might not be the proper fix for this bug, but we should still work on a fix. I'm investigating this now and will submit a PR shortly.

Patrick Wendell added a comment - 17/Feb/15 19:36

srowen yeah I meant the particular PR was bad, not that the issue does not exist.

Patrick Wendell added a comment - 17/Feb/15 19:36 srowen yeah I meant the particular PR was bad, not that the issue does not exist.

Sean R. Owen added a comment - 17/Feb/15 19:37

My fault, I misunderstood the resolution. yes please keep it open.

Sean R. Owen added a comment - 17/Feb/15 19:37 My fault, I misunderstood the resolution. yes please keep it open.

Apache Spark added a comment - 17/Feb/15 20:45

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4660

Apache Spark added a comment - 17/Feb/15 20:45 User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/4660

Patrick Wendell added a comment - 18/Feb/15 01:42

We can't be 100% sure this is fixed because it was not a reproducible issue. However, Josh has committed a patch that I think should make it hard to have race conditions around the cache location data structure.

Patrick Wendell added a comment - 18/Feb/15 01:42 We can't be 100% sure this is fixed because it was not a reproducible issue. However, Josh has committed a patch that I think should make it hard to have race conditions around the cache location data structure.

Patrick Wendell added a comment - 18/Feb/15 01:44

Actually, re-opening this since we need to back port it.

Patrick Wendell added a comment - 18/Feb/15 01:44 Actually, re-opening this since we need to back port it.

Rafal Kwasny added a comment - 18/Feb/15 09:40

Thanks for looking into this
The problem manifested when I had a job with multiple stages ( like 100 ) running in parallel ( through Scala Futures ) on the same RDD ( that had ~1000 partitions )
I will try to verify that this patch fixes the problem.

Rafal Kwasny added a comment - 18/Feb/15 09:40 Thanks for looking into this The problem manifested when I had a job with multiple stages ( like 100 ) running in parallel ( through Scala Futures ) on the same RDD ( that had ~1000 partitions ) I will try to verify that this patch fixes the problem.

Sean R. Owen added a comment - 02/Aug/15 08:29

Given the unlikelihood of a further 1.2.x release, I'm closing this as no longer needing a back port

Sean R. Owen added a comment - 02/Aug/15 08:29 Given the unlikelihood of a further 1.2.x release, I'm closing this as no longer needing a back port

People

Assignee:: Josh Rosen

Reporter:: Rafal Kwasny

Votes:: 2 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 17/Nov/14 18:45

Updated:: 17/May/20 17:48

Resolved:: 02/Aug/15 08:29