Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-1328

TS crashes in RemoteBootstrapSession::Init()

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.7.0
    • Fix Version/s: 0.7.0
    • Component/s: recovery, tserver
    • Labels:
      None

      Description

      Three nodes on the YCSB cluster crashed within the same minute of one another. The backtrace:

      #0  kudu::tserver::RemoteBootstrapSession::Init (this=0x4e633dc0)
          at /usr/src/debug/kudu-0.7.0-kudu0.7.0/src/kudu/tserver/remote_bootstrap_session.cc:94
      #1  0x00000000007871e8 in kudu::tserver::RemoteBootstrapServiceImpl::BeginRemoteBootstrapSession (this=0x33e4a20, 
          req=Unhandled dwarf expression opcode 0xf3) at /usr/src/debug/kudu-0.7.0-kudu0.7.0/src/kudu/tserver/remote_bootstrap_service.cc:130
      #2  0x00000000007f777a in kudu::tserver::RemoteBootstrapServiceIf::Handle (this=0x33e4a20, call=0xbc07e6c0)
          at /usr/src/debug/kudu-0.7.0-kudu0.7.0/build/release/src/kudu/tserver/remote_bootstrap.service.cc:59
      #3  0x00000000009d87b8 in kudu::rpc::ServicePool::RunThread (this=0x33d8dc0)
          at /usr/src/debug/kudu-0.7.0-kudu0.7.0/src/kudu/rpc/service_pool.cc:174
      #4  0x00000000017a1d1a in operator() (arg=0x3576f70)
          at /opt/toolchain/boost-pic-1.55.0/include/boost/function/function_template.hpp:767
      #5  kudu::Thread::SuperviseThread (arg=0x3576f70) at /usr/src/debug/kudu-0.7.0-kudu0.7.0/src/kudu/util/thread.cc:580
      #6  0x00000030234079d1 in start_thread () from /lib64/libpthread.so.0
      #7  0x00000030230e88fd in clone () from /lib64/libc.so.6
      

      The offending code:

        LOG(INFO) << "T " << tablet_peer_->tablet_id()
                  << " P " << tablet_peer_->consensus()->peer_uuid()
                  << ": Remote bootstrap: Opening " << data_blocks.size() << " blocks";
      

      Specifically, consensus() returns 0x0 so LOG() dereferences a null pointer. From the logging it looks like we're trying to remote bootstrap a tablet that has just been shut down, but on a macro level I don't know how that would happen. This is a regression from commit b841512 which introduced this LOG() statement. Fixing it is easy enough, but I'm going to try and add an integration test that teases out the crash.

      I've filed this as 0.7.0 blocker because I didn't know any better; feel free to kick it to 0.8.0 if you disagree.

        Attachments

          Activity

            People

            • Assignee:
              adar Adar Dembo
              Reporter:
              adar Adar Dembo
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: