Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
2.0.0
-
None
-
Spark 2.0-SNAPSHOT
Single Rack
Standalone mode scheduling
8 node cluster
16 cores & 64G RAM / node
Data Replication factor of 2Each Node has 1 Spark executors configured with 16 cores each and 40GB of RAM.
Description
We got the following log when running LiveJournalPageRank.
452823:16/03/21 19:28:47.444 TRACE BlockInfoManager: Task 1662 trying to acquire write lock for rdd_3_183
452825:16/03/21 19:28:47.445 TRACE BlockInfoManager: Task 1662 acquired write lock for rdd_3_183
456941:16/03/21 19:28:47.596 INFO BlockManager: Dropping block rdd_3_183 from memory
456943:16/03/21 19:28:47.597 DEBUG MemoryStore: Block rdd_3_183 of size 418784648 dropped from memory (free 3504141600)
457027:16/03/21 19:28:47.600 DEBUG BlockManagerMaster: Updated info of block rdd_3_183
457053:16/03/21 19:28:47.600 DEBUG BlockManager: Told master about block rdd_3_183
457082:16/03/21 19:28:47.602 TRACE BlockInfoManager: Task 1662 trying to remove block rdd_3_183
500373:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to put rdd_3_183
500374:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to acquire read lock for rdd_3_183
500375:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to acquire write lock for rdd_3_183
500376:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 acquired write lock for rdd_3_183
517257:16/03/21 19:28:56.299 INFO BlockInfoManager: ****** taskAttemptId is: 1662, info.writerTask is: 1681, blockID is: rdd_3_183 so AssertionError happeneds here*****
517258-16/03/21 19:28:56.299 ERROR Executor: Exception in task 177.0 in stage 10.0 (TID 1662)
517259-java.lang.AssertionError: assertion failed
517260- at scala.Predef$.assert(Predef.scala:151)
517261- at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1$$anonfun$apply$1.apply(BlockInfoManager.scala:356)
517262- at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1$$anonfun$apply$1.apply(BlockInfoManager.scala:351)
517263- at scala.Option.foreach(Option.scala:257)
517264- at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1.apply(BlockInfoManager.scala:351)
517265- at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1.apply(BlockInfoManager.scala:350)
517266- at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
517267- at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:350)
517268- at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:626)
517269- at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:238)
When memory for RDD storage is not sufficient and have to evict several partitions, this AssertionError may happened.
For the above example, this is because while running Task 1662, several partition (including rdd_3_183) need to be evicted. So Task 1662 acquired read and write locks at first, then doing dropBlock method in MemoryStore.evictBlocksToFreeSpace and actually dropping rdd_3_183 from memory. The newEffectiveStorageLevel.isValid is false, so we run into BlockInfoManager.removeBlock, but writeLocksByTask is not update here.
Unfortunately, Task 1681 is already started and needed to reproduce rdd_3_183 to produce it's target rdd here , and this task acquired write lock of rdd_3_183. When Task 1662 call releaseAllLocksForTask at last, this AssertionError occurs.
Attachments
Attachments
Issue Links
- links to