Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-2987

Intra location rebalance will crash in special case

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.9.0, 1.10.0, 1.10.1, 1.11.0
    • 1.12.0, 1.11.1
    • CLI
    • None

    Description

      Recently I am doing POC about rebalance and I get core when running intra location rebalance.

      Here is the log:

      I2019-10-30 20:02:17.843044 40915 rebalancer_tool.cc:225] running rebalancer within location '/location/2044'
      F2019-10-30 20:02:17.884591 40915 map-util.h:109] Check failed: it != collection.end() Map key not found: a9119004b2d24f42a1acf09d142565fb
      *** Check failure stack trace: ***
          @          0x111a75d  google::LogMessage::Fail()
          @          0x111c6d3  google::LogMessage::SendToLog()
          @          0x111a2b9  google::LogMessage::Flush()
          @          0x111d0ef  google::LogMessageFatal::~LogMessageFatal()
          @           0xe26da7  FindOrDie<>()
          @           0xe1f204  kudu::tools::RebalancerTool::AlgoBasedRunner::GetNextMovesImpl()
          @           0xe162e0  kudu::tools::RebalancerTool::BaseRunner::GetNextMoves()
          @           0xe15bf5  kudu::tools::RebalancerTool::RunWith()
          @           0xe1db0e  kudu::tools::RebalancerTool::Run()
          @           0xb6fea1  kudu::tools::(anonymous namespace)::RunRebalance()
          @           0xb70e14  std::_Function_handler<>::_M_invoke()
          @          0x11714a2  kudu::tools::Action::Run()
          @           0xc00587  kudu::tools::DispatchCommand()
          @           0xc00f4b  kudu::tools::RunTool()
          @           0xb0fd6d  main
          @     0x7f37086a4b15  __libc_start_main
          @           0xb6b399  (unknown)
      
      

      I found it may be the problem in RebalancerTool::AlgoBasedRunner::GetNextMovesImpl when building extra_info_by_tablet_id, it check that the table id in tablet must occur in table info. But when we build ClusterRawInfo in RebalancerTool::KsckResultsToClusterRawInfo we only collect the table occurs in location but all tablets in cluster. 

       This problem will occur when the location doesn't have replica for all table. When location is far more than table's replica it will happen.

       

       

      Attachments

        Activity

          People

            ZhangYao ZhangYao
            ZhangYao ZhangYao
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: