Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
3.5.0
-
None
Description
We are seeing query failure which is caused by RocksDB acquisition failure for the retry tasks.
- at t1, we shrink the cluster to only have one executor
23/03/05 22:47:21 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20230305224215-0000/2 is now DECOMMISSIONED (worker decommissioned because of kill request from HTTP endpoint (data migration disabled)) 23/03/05 22:47:21 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20230305224215-0000/3 is now DECOMMISSIONED (worker decommissioned because of kill request from HTTP endpoint (data migration disabled))
- at t1+2min, task 7 at its first attempt (i.e. task 7.0) is scheduled to the alive executor
23/03/05 22:49:58 INFO TaskSetManager: Starting task 7.0 in stage 133.0 (TID 685) (10.166.225.249, executor 0, partition 7, ANY,
It seems that task 7.0 is able to pass dataRDD.iterator(partition, ctxt) and acquires the rocksdb lock as we are seeing
23/03/05 22:51:59 WARN TaskSetManager: Lost task 4.1 in stage 133.1 (TID 700) (10.166.225.249 executor 0): java.lang.IllegalStateException: StateStoreId(opId=0,partId=7,name=default): RocksDB instance could not be acquired by [ThreadId: Some(50), task: partition 7.1 in stage 133.1, TID 700] as it was not released by [ThreadId: Some(449), task: partition 7.0 in stage 133.0, TID 685] after 60003 ms. 23/03/05 22:52:59 WARN TaskSetManager: Lost task 4.2 in stage 133.1 (TID 702) (10.166.225.249 executor 0): java.lang.IllegalStateException: StateStoreId(opId=0,partId=7,name=default): RocksDB instance could not be acquired by [ThreadId: Some(1495), task: partition 7.2 in stage 133.1, TID 702] as it was not released by [ThreadId: Some(449), task: partition 7.0 in stage 133.0, TID 685] after 60006 ms. 23/03/05 22:53:59 WARN TaskSetManager: Lost task 4.3 in stage 133.1 (TID 704) (10.166.225.249 executor 0): java.lang.IllegalStateException: StateStoreId(opId=0,partId=7,name=default): RocksDB instance could not be acquired by [ThreadId: Some(46), task: partition 7.3 in stage 133.1, TID 704] as it was not released by [ThreadId: Some(449), task: partition 7.0 in stage 133.0, TID 685] after 60003 ms.
Increasing the lockAcquireTimeoutMs to 2 minutes such that 4 task retries will give us 8 minutes to acquire the lock and it is larger than connectionTimeout with retries (3 * 120s).