Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-2014

error of Recovery failed: Failed to recover registrar: Failed to perform fetch within 5mins

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Done
    • 0.20.1
    • None
    • master
    • None
    • CentOS 6.3
      3.10.5-12.1.x86_64 #1 SMP Fri Aug 16 01:42:38 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

    Description

      I set up a mesos master cluster with 3 nodes. at the first, everything goes well, but when the leader master had dead, other candidate node can not recovery and elect new leader, all of candidate node will dead too.

      I1030 15:01:32.005691 6741 detector.cpp:138] Detected a new leader: (id='16')
      I1030 15:01:32.005692 6737 network.hpp:423] ZooKeeper group memberships changed
      I1030 15:01:32.006089 6741 group.cpp:658] Trying to get '/mesos/info_0000000016' in ZooKeeper
      I1030 15:01:32.006222 6738 group.cpp:658] Trying to get '/mesos/log_replicas/0000000015' in ZooKeeper
      I1030 15:01:32.007230 6738 group.cpp:658] Trying to get '/mesos/log_replicas/0000000016' in ZooKeeper
      I1030 15:01:32.007268 6736 detector.cpp:426] A new leading master (UPID=master@10.99.169.5:5050) is detected
      I1030 15:01:32.007546 6742 master.cpp:1196] The newly elected leader is master@10.99.169.5:5050 with id 20141030-150042-94987018-5050-6735
      I1030 15:01:32.007640 6742 master.cpp:1209] Elected as the leading master!
      I1030 15:01:32.007730 6742 master.cpp:1027] Recovering from registrar
      I1030 15:01:32.007895 6736 registrar.cpp:313] Recovering registrar
      I1030 15:01:32.008388 6742 network.hpp:461] ZooKeeper group PIDs:

      { log-replica(1)@10.99.169.5:5050, log-replica(1)@10.99.169.6:5050 }

      I1030 15:01:32.051316 6742 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
      I1030 15:01:32.889194 6738 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
      I1030 15:01:33.469511 6743 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
      I1030 15:01:34.324684 6740 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
      I1030 15:01:35.263629 6736 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
      I1030 15:01:36.212492 6739 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
      I1030 15:01:37.015682 6742 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
      I1030 15:01:37.781746 6743 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
      I1030 15:01:38.494547 6737 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
      I1030 15:01:39.186830 6740 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
      I1030 15:01:40.072258 6736 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
      I1030 15:01:40.855337 6743 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
      I1030 15:01:41.516916 6739 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
      I1030 15:01:41.556437 6744 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying
      I1030 15:01:41.557253 6741 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
      I1030 15:01:41.557502 6739 recover.cpp:188] Received a recover response from a replica in EMPTY status
      I1030 15:01:41.558156 6741 recover.cpp:188] Received a recover response from a replica in EMPTY status
      I1030 15:01:42.153370 6737 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
      I1030 15:01:42.505698 6742 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
      I1030 15:01:42.506060 6738 recover.cpp:188] Received a recover response from a replica in EMPTY status
      I1030 15:01:42.507046 6742 recover.cpp:188] Received a recover response from a replica in EMPTY status
      ......
      F1030 15:06:32.009464 6741 master.cpp:1016] Recovery failed: Failed to recover registrar: Failed to perform fetch within 5mins

      Core dump info:
      #0 0x0000003d636328a5 in raise () from /lib64/libc.so.6
      #1 0x0000003d63634085 in abort () from /lib64/libc.so.6
      #2 0x00007f7a452f0e19 in google::DumpStackTraceAndExit () at src/utilities.cc:147
      #3 0x00007f7a452e7d5d in google::LogMessage::Fail () at src/logging.cc:1458
      #4 0x00007f7a452ebd77 in google::LogMessage::SendToLog (this=0x7f7a41d8f9d0) at src/logging.cc:1412
      #5 0x00007f7a452e9bf9 in google::LogMessage::Flush (this=0x7f7a41d8f9d0) at src/logging.cc:1281
      #6 0x00007f7a452e9efd in google::LogMessageFatal::~LogMessageFatal (this=0x7f7a41d8f9d0, __in_chrg=<value optimized out>) at src/logging.cc:1984
      #7 0x00007f7a44d6759c in mesos::internal::master::fail (message="Recovery failed", failure="Failed to recover registrar: Failed to perform fetch within 5mins") at ../../src/master/master.cpp:1016
      #8 0x00007f7a44da75a6 in _call<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, 0, 1> (_functor=<value optimized out>, __args#0=
      "Failed to recover registrar: Failed to perform fetch within 5mins") at /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/tr1_impl/functional:1137
      #9 operator()<const std::basic_string<char, std::char_traits<char>, std::allocator<char> > > (__functor=<value optimized out>, __args#0="Failed to recover registrar: Failed to perform fetch within 5mins")
      at /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/tr1_impl/functional:1191
      #10 std::tr1::Function_handler<void(const std::string&), std::tr1::_Bind<void ((const char, std::tr1::_Placeholder<1>))(const std::string&, const std::string&)> >::_M_invoke(const std::tr1::_Any_data &, const std::string &) (_functor=<value optimized out>, __args#0="Failed to recover registrar: Failed to perform fetch within 5mins")
      at /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/tr1_impl/functional:1668
      #11 0x00007f7a44caff3c in process::Future<Nothing>::fail (this=0x7f7a140164f8, _message=<value optimized out>) at ../../3rdparty/libprocess/include/process/future.hpp:1628
      #12 0x00007f7a44de1a6a in fail (promise=std::tr1::shared_ptr (count 1) 0x7f7a140164f0, f=..., future=<value optimized out>) at ../../3rdparty/libprocess/include/process/future.hpp:789
      #13 process::internal::thenf<mesos::internal::Registry, Nothing>(const std::tr1::shared_ptr<process::Promise<Nothing> > &, const std::tr1::function<process::Future<Nothing>(const mesos::internal::Registry&)> &, const process::Future<mesos::internal::Registry> &) (promise=std::tr1::shared_ptr (count 1) 0x7f7a140164f0, f=..., future=<value optimized out>) at ../../3rdparty/libprocess/include/process/future.hpp:1438
      #14 0x00007f7a44e18ffc in process::Future<mesos::internal::Registry>::fail (this=0x7f7a2800be68, _message=<value optimized out>) at ../../3rdparty/libprocess/include/process/future.hpp:1634
      #15 0x00007f7a44e18f9c in process::Future<mesos::internal::Registry>::fail (this=0x7f7a2801c488, _message=<value optimized out>) at ../../3rdparty/libprocess/include/process/future.hpp:1628
      #16 0x00007f7a44e0cf4c in fail (this=0x2179b80, info=<value optimized out>, recovery=<value optimized out>) at ../../3rdparty/libprocess/include/process/future.hpp:789
      #17 mesos::internal::master::RegistrarProcess::_recover (this=0x2179b80, info=<value optimized out>, recovery=<value optimized out>) at ../../src/master/registrar.cpp:341
      #18 0x00007f7a44e24181 in _call<process::ProcessBase*&, 0, 1> (_functor=<value optimized out>, __args#0=<value optimized out>)
      at /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/tr1_impl/functional:1137
      #19 operator()<process::ProcessBase*> (__functor=<value optimized out>, __args#0=<value optimized out>) at /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/tr1_impl/functional:1191
      #20 std::tr1::Function_handler<void(process::ProcessBase*), std::tr1::_Bind<void ((std::tr1::_Placeholder<1>, std::tr1::shared_ptr<std::tr1::function<void(mesos::internal::master::RegistrarProcess)> >))(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void(mesos::internal::master::RegistrarProcess*)> >)> >::_M_invoke(const std::tr1::_Any_data &, process::ProcessBase *) (_functor=<value optimized out>,
      __args#0=<value optimized out>) at /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/tr1_impl/functional:1668
      #21 0x00007f7a452814f4 in process::ProcessManager::resume (this=0x214b690, process=0x2179e28) at ../../../3rdparty/libprocess/src/process.cpp:2848
      #22 0x00007f7a45281dec in process::schedule (arg=<value optimized out>) at ../../../3rdparty/libprocess/src/process.cpp:1479
      #23 0x0000003d63a07851 in start_thread () from /lib64/libpthread.so.0
      #24 0x0000003d636e811d in clone () from /lib64/libc.so.6

      Attachments

        Activity

          People

            Unassigned Unassigned
            jesson Ji Huang
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: