[SPARK-14055] AssertionError may happeneds if not unlock writeLock when doing 'removeBlock' method - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 2.0.0
Fix Version/s: 2.0.0
Component/s: Block Manager, Spark Core
Labels:
None
Environment:

Spark 2.0-SNAPSHOT
Single Rack
Standalone mode scheduling
8 node cluster
16 cores & 64G RAM / node
Data Replication factor of 2

Each Node has 1 Spark executors configured with 16 cores each and 40GB of RAM.

Target Version/s:

2.0.0

Description

We got the following log when running LiveJournalPageRank.

452823:16/03/21 19:28:47.444 TRACE BlockInfoManager: Task 1662 trying to acquire write lock for rdd_3_183
452825:16/03/21 19:28:47.445 TRACE BlockInfoManager: Task 1662 acquired write lock for rdd_3_183
456941:16/03/21 19:28:47.596 INFO BlockManager: Dropping block rdd_3_183 from memory
456943:16/03/21 19:28:47.597 DEBUG MemoryStore: Block rdd_3_183 of size 418784648 dropped from memory (free 3504141600)
457027:16/03/21 19:28:47.600 DEBUG BlockManagerMaster: Updated info of block rdd_3_183
457053:16/03/21 19:28:47.600 DEBUG BlockManager: Told master about block rdd_3_183
457082:16/03/21 19:28:47.602 TRACE BlockInfoManager: Task 1662 trying to remove block rdd_3_183
500373:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to put rdd_3_183
500374:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to acquire read lock for rdd_3_183
500375:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to acquire write lock for rdd_3_183
500376:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 acquired write lock for rdd_3_183
517257:16/03/21 19:28:56.299 INFO BlockInfoManager: ****** taskAttemptId is: 1662, info.writerTask is: 1681, blockID is: rdd_3_183 so AssertionError happeneds here*****
517258-16/03/21 19:28:56.299 ERROR Executor: Exception in task 177.0 in stage 10.0 (TID 1662)
517259-java.lang.AssertionError: assertion failed
517260- at scala.Predef$.assert(Predef.scala:151)
517261- at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1$$anonfun$apply$1.apply(BlockInfoManager.scala:356)
517262- at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1$$anonfun$apply$1.apply(BlockInfoManager.scala:351)
517263- at scala.Option.foreach(Option.scala:257)
517264- at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1.apply(BlockInfoManager.scala:351)
517265- at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1.apply(BlockInfoManager.scala:350)
517266- at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
517267- at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:350)
517268- at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:626)
517269- at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:238)

When memory for RDD storage is not sufficient and have to evict several partitions, this AssertionError may happened.
For the above example, this is because while running Task 1662, several partition (including rdd_3_183) need to be evicted. So Task 1662 acquired read and write locks at first, then doing dropBlock method in MemoryStore.evictBlocksToFreeSpace and actually dropping rdd_3_183 from memory. The newEffectiveStorageLevel.isValid is false, so we run into BlockInfoManager.removeBlock, but writeLocksByTask is not update here.

Unfortunately, Task 1681 is already started and needed to reproduce rdd_3_183 to produce it's target rdd here , and this task acquired write lock of rdd_3_183. When Task 1662 call releaseAllLocksForTask at last, this AssertionError occurs.