Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-2819

SIGSEGV during kudu cluster rebalance

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.9.0
    • 1.9.1, 1.10.0
    • None
    • None

    Description

      While utilizing the Kudu rebalancer utility, a SegFault is consistently occurring during run-time. 

      The following is seen on the client running the balancer command:

      *** Aborted at 1556920300 (unix time) try "date -d @1556920300" if you are using GNU date ***
      PC: @ 0x2972aec tc_new
      *** SIGSEGV (@0x0) received by PID 62640 (TID 0x7f5f7191b980) from PID 0; stack trace: ***
          @ 0x369b00f7e0 (unknown)
          @ 0x2972aec tc_new
          @ 0xc6a077 kudu::client::KuduClient::Data::GetTableSchema()
          @ 0xc56e0d kudu::client::KuduClient::OpenTable()
          @ 0xc38228 kudu::tools::RemoteKsckCluster::RetrieveTablesList()
          @ 0xc2953a kudu::tools::KsckCluster::FetchTableAndTabletInfo()
          @ 0xc217c4 kudu::tools::Ksck::FetchTableAndTabletInfo()
          @ 0xdad2c1 kudu::tools::DoKsckForTablet()
          @ 0xdaf244 kudu::tools::CheckCompleteMove()
          @ 0xd84c18 kudu::tools::Rebalancer::AlgoBasedRunner::UpdateMovesInProgressStatus()
          @ 0xd816f4 kudu::tools::Rebalancer::RunWith()
          @ 0xd8dac6 kudu::tools::Rebalancer::Run()
          @ 0xb34011 (unknown)
          @ 0xb353a4 std::_Function_handler<>::_M_invoke()
          @ 0x10b7eda kudu::tools::Action::Run()
          @ 0xbb4f04 kudu::tools::DispatchCommand()
          @ 0xbb56d3 kudu::tools::RunTool()
          @ 0xad6778 main
          @ 0x369ac1ed1d __libc_start_main
          @ 0xb2ed7d (unknown)
      Segmentation fault (core dumped)

       

      Generating the backtrace of the core dump gives us the following, occurring within gperftools:

      #0 SLL_Next (t=0x59c18bbfeed6371)
      at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/thirdparty/src/gperftools-2.6.90/src/linked_list.h:45
      #1 SLL_TryPop (rv=<synthetic pointer>, list=0x58d4d60)
      at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/thirdparty/src/gperftools-2.6.90/src/linked_list.h:69
      #2 TryPop (rv=<synthetic pointer>, this=0x58d4d60)
      at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/thirdparty/src/gperftools-2.6.90/src/thread_cache.h:220
      #3 Allocate (oom_handler=0x29711c0 <tcmalloc::cpp_throw_oom(unsigned long)>, cl=9, size=128, this=<optimized out>)
      at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/thirdparty/src/gperftools-2.6.90/src/thread_cache.h:379
      #4 malloc_fast_path<tcmalloc::cpp_throw_oom> (size=<optimized out>)
      at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/thirdparty/src/gperftools-2.6.90/src/tcmalloc.cc:1848
      #5 tc_new (size=<optimized out>) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/thirdparty/src/gperftools-2.6.90/src/tcmalloc.cc:1969
      #6 0x0000000000c6a077 in allocate (__n=1, this=<synthetic pointer>) at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/ext/new_allocator.h:104
      #7 allocate (__a=<synthetic pointer>, __n=1) at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/alloc_traits.h:357
      #8 __shared_count<kudu::Synchronizer::Data, std::allocator<kudu::Synchronizer::Data> > (__a=..., this=0x7fff13bcbde8)
      at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr_base.h:616
      #9 __shared_ptr<std::allocator<kudu::Synchronizer::Data> > (__a=..., __tag=..., this=0x7fff13bcbde0)
      at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr_base.h:1090
      #10 shared_ptr<std::allocator<kudu::Synchronizer::Data> > (__a=..., __tag=..., this=0x7fff13bcbde0)
      at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr.h:316
      #11 allocate_shared<kudu::Synchronizer::Data, std::allocator<kudu::Synchronizer::Data> > (__a=...)
      at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr.h:588
      #12 make_shared<kudu::Synchronizer::Data> () at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr.h:604
      #13 Synchronizer (this=0x7fff13bcbde0) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/util/async_util.h:47
      #14 kudu::client::KuduClient::Data::GetTableSchema (this=<optimized out>, client=client@entry=0x11fe5440, table_name="impala::database.some_table",
      deadline=..., schema=schema@entry=0x7fff13bcc070, partition_schema=0x7fff13bcc0c0, table_id=0x7fff13bcc080, num_replicas=0x7fff13bcc068)
      at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/client/client-internal.cc:441
      #15 0x0000000000c56e0d in kudu::client::KuduClient::OpenTable (this=0x11fe5440, table_name="impala::database.some_table", table=table@entry=0x7fff13bcc180)
      at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/client/client.cc:513
      #16 0x0000000000c38228 in kudu::tools::RemoteKsckCluster::RetrieveTablesList (this=0x607d680)
      at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/ksck_remote.cc:502
      #17 0x0000000000c2953a in kudu::tools::KsckCluster::FetchTableAndTabletInfo (this=0x607d680)
      at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/ksck.h:408
      #18 0x0000000000c217c4 in kudu::tools::Ksck::FetchTableAndTabletInfo (this=this@entry=0x7fff13bcc510)
      at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/ksck.cc:302
      ---Type <return> to continue, or q <return> to quit---
      #19 0x0000000000dad2c1 in kudu::tools::DoKsckForTablet (master_addresses=std::vector of length 3, capacity 3 = {...}, tablet_id="00229fcb55dc4a348e8caae7f7a3fc41")
      at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/tool_replica_util.cc:624
      #20 0x0000000000daf244 in kudu::tools::CheckCompleteMove (master_addresses=std::vector of length 3, capacity 3 = {...},
      client=std::tr1::shared_ptr (count 1) 0x103345a0, tablet_id="00229fcb55dc4a348e8caae7f7a3fc41", from_ts_uuid="05d76878409e448fba542fade206dd15",
      to_ts_uuid="26d44b84ff3645d18f03b05a816e21eb", is_complete=0x7fff13bccb4f, completion_status=0x7fff13bccb50)
      at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/tool_replica_util.cc:319
      #21 0x0000000000d84c18 in kudu::tools::Rebalancer::AlgoBasedRunner::UpdateMovesInProgressStatus (this=0x7fff13bcd090, has_errors=0x7fff13bccd40,
      timed_out=0x7fff13bcccdf) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/rebalancer.cc:1173
      #22 0x0000000000d816f4 in kudu::tools::Rebalancer::RunWith (this=this@entry=0x7fff13bd2390, runner=runner@entry=0x7fff13bcd090,
      result_status=result_status@entry=0x7fff13bd20ec) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/rebalancer.cc:912
      #23 0x0000000000d8dac6 in kudu::tools::Rebalancer::Run (this=this@entry=0x7fff13bd2390, result_status=result_status@entry=0x7fff13bd20ec,
      moves_count=moves_count@entry=0x7fff13bd21c8) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/rebalancer.cc:203
      #24 0x0000000000b34011 in kudu::tools::(anonymous namespace)::RunRebalance (context=...)
      at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/tool_action_cluster.cc:319
      #25 0x0000000000b353a4 in std::_Function_handler<kudu::Status (kudu::tools::RunnerContext const&), kudu::Status (*)(kudu::tools::RunnerContext const&)>::_M_invoke(std::_Any_data const&, kudu::tools::RunnerContext const&) (__functor=..., __args#0=...) at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/functional:2025
      #26 0x00000000010b7eda in operator() (__args#0=..., this=0x613a650) at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/functional:2439
      Python Exception <class 'gdb.error'> There is no member or method named _M_element_count.:
      #27 kudu::tools::Action::Run (this=this@entry=0x613a630, chain=std::vector of length 2, capacity 2 = {...}, required_args=,
      variadic_args=std::vector of length 0, capacity 0)
      at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/tool_action.cc:258
      #28 0x0000000000bb4f04 in kudu::tools::DispatchCommand (chain=std::vector of length 2, capacity 2 = {...}, action=action@entry=0x613a630,
      remaining_args=std::deque with 1 elements = {...}) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/tool_main.cc:132
      #29 0x0000000000bb56d3 in kudu::tools::RunTool (argc=4, argv=0x7fff13bd2960, show_help=show_help@entry=false)
      at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/tool_main.cc:204
      #30 0x0000000000ad6778 in main (argc=4, argv=0x7fff13bd2960)
      at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/tool_main.cc:265

       

      I don't see an obvious memory mismanagement scenario, like a double-free or use after free.
      I suspect there might either be corruption of memory at some point prior to this, or that there's a bug in tcmalloc itself.

      Attachments

        Activity

          People

            aserbin Alexey Serbin
            mbarnett Mitch Barnett
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: