Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-8703

Mesos master can`t reconnect to zookeeper

Attach filesAttach ScreenshotVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Duplicate
    • 1.4.1
    • None
    • master
    • None

    Description

      Mesos master can`t reconnect to zookeeper after zookeeper hangs.

      2018-03-20 10:16:45,608:1(0x2ae675db6700):ZOO_ERROR@handle_socket_error_msg@1666: Socket [<zknode1>:2181] zk retcode=-7, errno=110(Connection timed out): connection to <zknode1>:2181 timed out (exceeded timeout by 3ms)
      2018-03-20 10:16:45,609:1(0x2ae675db6700):ZOO_INFO@check_events@1728: initiated connection to server [<zknode2>:2181]
      2018-03-20 10:16:45,619:1(0x2ae675db6700):ZOO_ERROR@handle_socket_error_msg@1764: Socket [<zknode2>:2181] zk retcode=-112, errno=116(Stale file handle): sessionId=0x5623d0e483dd435 has expired.
      I0320 10:16:45.620604    18 group.cpp:511] ZooKeeper session expired
      I0320 10:16:45.620802    16 detector.cpp:152] Detected a new leader: None
      I0320 10:16:45.620957    16 master.cpp:2176] The newly elected leader is None
      mesos-master: ../../3rdparty/stout/include/stout/option.hpp:112: T& Option<T>::get() & [with T = mesos::MasterInfo]: Assertion `isSome()' failed.
      *** Aborted at 1521541005 (unix time) try "date -d @1521541005" if you are using GNU date ***
      PC: @     0x2ae63d2b9428 (unknown)
      *** SIGABRT (@0x1) received by PID 1 (TID 0x2ae648ffa700) from PID 1; stack trace: ***
          @     0x2ae63d078390 (unknown)
          @     0x2ae63d2b9428 (unknown)
          @     0x2ae63d2bb02a (unknown)
          @     0x2ae63d2b1bd7 (unknown)
          @     0x2ae63d2b1c82 (unknown)
      2018-03-20 10:16:45,622:1(0x2ae649ffc700):ZOO_INFO@zookeeper_close@2543: Freeing zookeeper resources for sessionId=0x5623d0e483dd435
      
      2018-03-20 10:16:45,623:1(0x2ae6477f7700):ZOO_INFO@log_env@726: Client environment:zookeeper.version=zookeeper C client 3.4.8
      2018-03-20 10:16:45,623:1(0x2ae6477f7700):ZOO_INFO@log_env@730: Client environment:host.name=<mesos_hostname>
      2018-03-20 10:16:45,623:1(0x2ae6477f7700):ZOO_INFO@log_env@737: Client environment:os.name=Linux
      2018-03-20 10:16:45,623:1(0x2ae6477f7700):ZOO_INFO@log_env@738: Client environment:os.arch=4.8.15-1.el7.wg.x86_64
      2018-03-20 10:16:45,623:1(0x2ae6477f7700):ZOO_INFO@log_env@739: Client environment:os.version=#1 SMP Mon Dec 26 14:34:45 UTC 2016
      2018-03-20 10:16:45,624:1(0x2ae6477f7700):ZOO_INFO@log_env@747: Client environment:user.name=(null)
      2018-03-20 10:16:45,624:1(0x2ae6477f7700):ZOO_INFO@log_env@755: Client environment:user.home=/root
      2018-03-20 10:16:45,624:1(0x2ae6477f7700):ZOO_INFO@log_env@767: Client environment:user.dir=/
      2018-03-20 10:16:45,624:1(0x2ae6477f7700):ZOO_INFO@zookeeper_init@800: Initiating client connection, host=<zk_pool> sessionTimeout=10000 watcher=0x2ae63b3711e0 sessionId=0 sessionPasswd=<null> context=0x2ae6900036f8 flags=0
          @     0x2ae63ad6b55b mesos::internal::master::Master::detected()
          @     0x2ae63b9e4cfc process::ProcessBase::visit()
      2018-03-20 10:16:45,634:1(0x2ae6765b7700):ZOO_INFO@check_events@1728: initiated connection to server [<zknode1>:2181]
          @     0x2ae63b9fac84 process::ProcessManager::resume()
          @     0x2ae63b9fd5e6 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
          @     0x2ae63c87ec80 (unknown)
          @     0x2ae63d06e6ba start_thread
          @     0x2ae63d38b3dd (unknown)
      2018-03-20 10:16:45,651:1(0x2ae6765b7700):ZOO_INFO@check_events@1775: session establishment complete on server [<zknode1>:2181], sessionId=0x1623f43348692c7, negotiated timeout=10000
      I0320 10:16:45.651684    15 group.cpp:341] Group process (zookeeper-group(2)@<mesos4>:5050) connected to ZooKeeper
      I0320 10:16:45.651733    15 group.cpp:831] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
      I0320 10:16:45.651743    15 group.cpp:419] Trying to create path '/mesos' in ZooKeeper
      I0320 10:16:45.676736    15 detector.cpp:152] Detected a new leader: (id='704')
      I0320 10:16:45.676844    15 group.cpp:700] Trying to get '/mesos/json.info_0000000704' in ZooKeeper
      I0320 10:16:45.683346    15 zookeeper.cpp:262] A new leading master (UPID=master@<mesos4>:5050) is detected
      

      After this, mesos master do not answer for http requests, and leader election do not happens.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            Lomonosow Anton Malevich
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment