For multi-master usage to truly be safe, we must ensure that a failure to write to the system catalog table is handled correctly. When there's only one master this can only happen in the event of a disk failure or equivalent, but with multiple masters, failures can happen all the time (i.e. failed replicas, network partitions, etc.)
So far I've only found one case where this is truly broken, in catalog_manager.cc:L2444:
2433 void CatalogManager::DeleteTabletsAndSendRequests(const scoped_refptr<TableInfo>& table) { 2434 vector<scoped_refptr<TabletInfo> > tablets; 2435 table->GetAllTablets(&tablets); 2436 2437 string deletion_msg = "Table deleted at " + LocalTimeAsString(); 2438 2439 for (const scoped_refptr<TabletInfo>& tablet : tablets) { 2440 DeleteTabletReplicas(tablet.get(), deletion_msg); 2441 2442 TabletMetadataLock tablet_lock(tablet.get(), TabletMetadataLock::WRITE); 2443 tablet_lock.mutable_data()->set_state(SysTabletsEntryPB::DELETED, deletion_msg); >2444 CHECK_OK(sys_catalog_->UpdateTablets({ tablet.get() })); 2445 tablet_lock.Commit(); 2446 } 2447 }
In this case we should batch up all of the tablet deletions into one UpdateTablets() call, and pass the status up to the DeleteTable caller too.
Part of the work here is an integration test that provides good coverage for the various failure paths.
- duplicates
-
KUDU-495 Master should handle the case where we REPLICATE fails
-
- Resolved
-