Details
Description
Recently I am doing POC about rebalance and I get core when running intra location rebalance.
Here is the log:
I2019-10-30 20:02:17.843044 40915 rebalancer_tool.cc:225] running rebalancer within location '/location/2044'
F2019-10-30 20:02:17.884591 40915 map-util.h:109] Check failed: it != collection.end() Map key not found: a9119004b2d24f42a1acf09d142565fb
*** Check failure stack trace: ***
@ 0x111a75d google::LogMessage::Fail()
@ 0x111c6d3 google::LogMessage::SendToLog()
@ 0x111a2b9 google::LogMessage::Flush()
@ 0x111d0ef google::LogMessageFatal::~LogMessageFatal()
@ 0xe26da7 FindOrDie<>()
@ 0xe1f204 kudu::tools::RebalancerTool::AlgoBasedRunner::GetNextMovesImpl()
@ 0xe162e0 kudu::tools::RebalancerTool::BaseRunner::GetNextMoves()
@ 0xe15bf5 kudu::tools::RebalancerTool::RunWith()
@ 0xe1db0e kudu::tools::RebalancerTool::Run()
@ 0xb6fea1 kudu::tools::(anonymous namespace)::RunRebalance()
@ 0xb70e14 std::_Function_handler<>::_M_invoke()
@ 0x11714a2 kudu::tools::Action::Run()
@ 0xc00587 kudu::tools::DispatchCommand()
@ 0xc00f4b kudu::tools::RunTool()
@ 0xb0fd6d main
@ 0x7f37086a4b15 __libc_start_main
@ 0xb6b399 (unknown)
I found it may be the problem in RebalancerTool::AlgoBasedRunner::GetNextMovesImpl when building extra_info_by_tablet_id, it check that the table id in tablet must occur in table info. But when we build ClusterRawInfo in RebalancerTool::KsckResultsToClusterRawInfo we only collect the table occurs in location but all tablets in cluster.
This problem will occur when the location doesn't have replica for all table. When location is far more than table's replica it will happen.