[KUDU-1328] TS crashes in RemoteBootstrapSession::Init() - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 0.7.0
Fix Version/s: 0.7.0
Component/s: recovery, tserver
Labels:
None

Target Version/s:

0.7.0
Code Review:
http://gerrit.cloudera.org:8080/#/c/2193/

Description

Three nodes on the YCSB cluster crashed within the same minute of one another. The backtrace:

#0  kudu::tserver::RemoteBootstrapSession::Init (this=0x4e633dc0)
    at /usr/src/debug/kudu-0.7.0-kudu0.7.0/src/kudu/tserver/remote_bootstrap_session.cc:94
#1  0x00000000007871e8 in kudu::tserver::RemoteBootstrapServiceImpl::BeginRemoteBootstrapSession (this=0x33e4a20, 
    req=Unhandled dwarf expression opcode 0xf3) at /usr/src/debug/kudu-0.7.0-kudu0.7.0/src/kudu/tserver/remote_bootstrap_service.cc:130
#2  0x00000000007f777a in kudu::tserver::RemoteBootstrapServiceIf::Handle (this=0x33e4a20, call=0xbc07e6c0)
    at /usr/src/debug/kudu-0.7.0-kudu0.7.0/build/release/src/kudu/tserver/remote_bootstrap.service.cc:59
#3  0x00000000009d87b8 in kudu::rpc::ServicePool::RunThread (this=0x33d8dc0)
    at /usr/src/debug/kudu-0.7.0-kudu0.7.0/src/kudu/rpc/service_pool.cc:174
#4  0x00000000017a1d1a in operator() (arg=0x3576f70)
    at /opt/toolchain/boost-pic-1.55.0/include/boost/function/function_template.hpp:767
#5  kudu::Thread::SuperviseThread (arg=0x3576f70) at /usr/src/debug/kudu-0.7.0-kudu0.7.0/src/kudu/util/thread.cc:580
#6  0x00000030234079d1 in start_thread () from /lib64/libpthread.so.0
#7  0x00000030230e88fd in clone () from /lib64/libc.so.6

The offending code:

  LOG(INFO) << "T " << tablet_peer_->tablet_id()
            << " P " << tablet_peer_->consensus()->peer_uuid()
            << ": Remote bootstrap: Opening " << data_blocks.size() << " blocks";

Specifically, consensus() returns 0x0 so LOG() dereferences a null pointer. From the logging it looks like we're trying to remote bootstrap a tablet that has just been shut down, but on a macro level I don't know how that would happen. This is a regression from commit b841512 which introduced this LOG() statement. Fixing it is easy enough, but I'm going to try and add an integration test that teases out the crash.

I've filed this as 0.7.0 blocker because I didn't know any better; feel free to kick it to 0.8.0 if you disagree.

Attachments

Activity

People

Assignee:: Adar Dembo

Reporter:: Adar Dembo

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 13/Feb/16 02:49

Updated:: 17/Feb/16 03:00

Resolved:: 17/Feb/16 03:00