Description
One of our production environment clusters cause an accident some days ago, one user has a table, partition schema looks like:
HASH (uuid) PARTITIONS 80,RANGE (date_hour) ( PARTITION 2019102900 <= VALUES < 2019102901, PARTITION 2019102901 <= VALUES < 2019102902, PARTITION 2019102902 <= VALUES < 2019102903, PARTITION 2019102903 <= VALUES < 2019102904, PARTITION 2019102904 <= VALUES < 2019102905, ...)
He try to remove many outdated partitions once by SparkSQL, but it returns an timeout error at first, then he try again and again, and SparkSQL failed again and again. Then the cluster became unstable, memory usage and CPU load increasing.
I found many log like:
W1030 17:29:53.382287 7588 rpcz_store.cc:259] Trace: 1030 17:26:19.714799 (+ 0us) service_pool.cc:162] Inserting onto call queue 1030 17:26:19.714808 (+ 9us) service_pool.cc:221] Handling call 1030 17:29:53.382204 (+213667396us) ts_tablet_manager.cc:874] Deleting tablet c52c5f43f7884d08b07fd0005e878fed 1030 17:29:53.382205 (+ 1us) ts_tablet_manager.cc:794] Acquired tablet manager lock 1030 17:29:53.382208 (+ 3us) inbound_call.cc:162] Queueing success response Metrics: {"tablet-delete.queue_time_us":213667360} W1030 17:29:53.382300 7586 rpcz_store.cc:253] Call kudu.tserver.TabletServerAdminService.DeleteTablet from 10.152.49.21:55576 (request call id 1820316) took 213667 ms (3.56 min). Client timeout 29999 ms (30 s) W1030 17:29:53.382292 10623 rpcz_store.cc:253] Call kudu.tserver.TabletServerAdminService.DeleteTablet from 10.152.49.21:55576 (request call id 1820315) took 213667 ms (3.56 min). Client timeout 29999 ms (30 s) W1030 17:29:53.382297 10622 rpcz_store.cc:259] Trace: 1030 17:26:19.714825 (+ 0us) service_pool.cc:162] Inserting onto call queue 1030 17:26:19.714833 (+ 8us) service_pool.cc:221] Handling call 1030 17:29:53.382239 (+213667406us) ts_tablet_manager.cc:874] Deleting tablet 479f8c592f16408c830637a0129359e1 1030 17:29:53.382241 (+ 2us) ts_tablet_manager.cc:794] Acquired tablet manager lock 1030 17:29:53.382244 (+ 3us) inbound_call.cc:162] Queueing success response Metrics: {"tablet-delete.queue_time_us":213667378} ...
That means 'Acquired tablet manager lock' cost much time, right?
Status TSTabletManager::BeginReplicaStateTransition( const string& tablet_id, const string& reason, scoped_refptr<TabletReplica>* replica, scoped_refptr<TransitionInProgressDeleter>* deleter, TabletServerErrorPB::Code* error_code) { // Acquire the lock in exclusive mode as we'll add a entry to the // transition_in_progress_ map. std::lock_guard<RWMutex> lock(lock_); TRACE("Acquired tablet manager lock"); RETURN_NOT_OK(CheckRunningUnlocked(error_code)); ... }
But I think the root case is Kudu master send too many duplicate 'alter table/delete tablet' request to tserver. I found more info in master's log:
$ grep "Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d" kudu_master.zjy-hadoop-prc-ct01.bj.work.log.INFO.20191030-204137.62788 | egrep "attempt = 1\)" I1030 20:41:42.207222 62821 catalog_manager.cc:2971] Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 43 ms (attempt = 1) I1030 20:41:42.207556 62821 catalog_manager.cc:2971] Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 40 ms (attempt = 1) I1030 20:41:42.260052 62821 catalog_manager.cc:2971] Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 31 ms (attempt = 1) I1030 20:41:42.278609 62821 catalog_manager.cc:2971] Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 19 ms (attempt = 1) I1030 20:41:42.312175 62821 catalog_manager.cc:2971] Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 48 ms (attempt = 1) I1030 20:41:42.318933 62821 catalog_manager.cc:2971] Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 62 ms (attempt = 1) I1030 20:41:42.340060 62821 catalog_manager.cc:2971] Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 30 ms (attempt = 1) ...
That means master received too many duplicate 'delete tablet' request from client, and then dispatch these request to tservers.
I think we should limit the concurrent alter request of a table, when a alter request is on going and hasn't been finished, the following request should be rejected. Or we should limit the size of TableInfo::pending_tasks_.