Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-465

Failures due to ZooKeeper operation timeouts in the master detector.

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.16.0
    • Component/s: None
    • Labels:

      Description

      Observed this in production when ZK connectivity issues were present:

      I0508 19:15:48.485126 18915 slave.cpp:951] Process exited: @0.0.0.0:0
      W0508 19:15:48.485252 18915 slave.cpp:954] WARNING! Master disconnected! Waiting for a new master to be elected.
      2013-05-08 19:15:51,621:18889(0x4ddeb940):ZOO_ERROR@handle_socket_error_msg@1528: Socket [10.40.93.122:2181] zk retcode=-7, errno=110(Connection timed out): connection timed out (exceeded timeout by 3ms)
      2013-05-08 19:15:51,621:18889(0x4ddeb940):ZOO_INFO@check_events@1585: initiated connection to server [10.40.89.122:2181]
      2013-05-08 19:15:51,623:18889(0x4ddeb940):ZOO_ERROR@handle_socket_error_msg@1621: Socket [10.40.89.122:2181] zk retcode=-112, errno=116(Stale NFS file handle): sessionId=0x33e8587f8b50141 has expired.
      2013-05-08 19:15:51,623:18889(0x4ddeb940):ZOO_DEBUG@handle_error@1138: Calling a watcher for a ZOO_SESSION_EVENT and the state=ZOO_EXPIRED_SESSION_STATE
      2013-05-08 19:15:51,623:18889(0x4ddeb940):ZOO_DEBUG@do_io@317: IO thread terminated
      2013-05-08 19:15:51,623:18889(0x4d5ea940):ZOO_DEBUG@process_completions@1765: Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
      W0508 19:15:51.623631 18904 detector.cpp:393] Master detector (slave(1)@IP:PORT) ZooKeeper session expired!
      2013-05-08 19:15:51,623:18889(0x4d5ea940):ZOO_DEBUG@do_completion@336: completion thread terminated
      2013-05-08 19:15:51,623:18889(0x46ddd940):ZOO_INFO@zookeeper_close@2321: Freeing zookeeper resources for sessionId=0x33e8587f8b50141

      2013-05-08 19:15:51,625:18889(0x46ddd940):ZOO_INFO@log_env@658: Client environment:zookeeper.version=zookeeper C client 3.3.4
      2013-05-08 19:15:51,626:18889(0x46ddd940):ZOO_INFO@log_env@662: Client environment:host.name=HOSTNAME
      2013-05-08 19:15:51,626:18889(0x46ddd940):ZOO_INFO@log_env@669: Client environment:os.name=Linux
      2013-05-08 19:15:51,626:18889(0x46ddd940):ZOO_INFO@log_env@670: Client environment:os.arch=2.6.39-t7.el5
      2013-05-08 19:15:51,626:18889(0x46ddd940):ZOO_INFO@log_env@671: Client environment:os.version=#1 SMP Tue Oct 23 23:59:44 UTC 2012
      2013-05-08 19:15:51,626:18889(0x46ddd940):ZOO_INFO@log_env@679: Client environment:user.name=(null)
      2013-05-08 19:15:51,626:18889(0x46ddd940):ZOO_INFO@log_env@687: Client environment:user.home=/root
      2013-05-08 19:15:51,626:18889(0x46ddd940):ZOO_INFO@log_env@699: Client environment:user.dir=/
      2013-05-08 19:15:51,626:18889(0x46ddd940):ZOO_INFO@zookeeper_init@727: Initiating client connection, host=ZKHOST:2181 sessionTimeout=10000 watcher=0x7f6da70adc00 sessionId=0 sessionPasswd
      =<null> context=0x7f6d9c186640 flags=0
      2013-05-08 19:15:51,629:18889(0x46ddd940):ZOO_DEBUG@start_threads@152: starting threads...
      2013-05-08 19:15:51,629:18889(0x4d5ea940):ZOO_DEBUG@do_io@279: started IO thread
      2013-05-08 19:15:51,629:18889(0x4ddeb940):ZOO_DEBUG@do_completion@326: started completion thread
      2013-05-08 19:15:51,630:18889(0x4d5ea940):ZOO_INFO@check_events@1585: initiated connection to server [IP:2181]
      2013-05-08 19:15:51,632:18889(0x4d5ea940):ZOO_INFO@check_events@1632: session establishment complete on server [ZK_IP:2181], sessionId=0x3e85920347000b, negotiated timeout=10000
      2013-05-08 19:15:51,632:18889(0x4d5ea940):ZOO_DEBUG@check_events@1638: Calling a watcher for a ZOO_SESSION_EVENT and the state=ZOO_CONNECTED_STATE
      2013-05-08 19:15:51,632:18889(0x4ddeb940):ZOO_DEBUG@process_completions@1765: Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
      I0508 19:15:51.632997 18908 detector.cpp:283] Master detector (slave(1)@IP:5051) connected to ZooKeeper ...
      I0508 19:15:51.633085 18908 detector.cpp:288] Authenticating to ZooKeeper using scheme 'digest'
      2013-05-08 19:15:51,633:18889(0x48de1940):ZOO_DEBUG@send_last_auth_info@1265: Sending auth info request to ZK_IP:2181
      I0508 19:15:53.453980 18913 http.cpp:280] HTTP request for '/slave(1)/stats.json'
      2013-05-08 19:15:58,302:18889(0x4d5ea940):ZOO_ERROR@handle_socket_error_msg@1528: Socket [ZK_IP:2181] zk retcode=-7, errno=110(Connection timed out): connection timed out (exceeded timeout by 4ms)
      2013-05-08 19:15:58,302:18889(0x4d5ea940):ZOO_DEBUG@handle_error@1141: Calling a watcher for a ZOO_SESSION_EVENT and the state=CONNECTING_STATE
      2013-05-08 19:15:58,302:18889(0x4ddeb940):ZOO_DEBUG@process_completions@1765: Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
      F0508 19:15:58.303388 18908 detector.cpp:291] Failed to authenticate with ZooKeeper: operation timeout

          • Check failure stack trace: ***
            @ 0x7f6da732669d google::LogMessage::Fail()
            @ 0x7f6da732c307 google::LogMessage::SendToLog()
            @ 0x7f6da7327f4c google::LogMessage::Flush()
            @ 0x7f6da73281b6 google::LogMessageFatal::~LogMessageFatal()
            @ 0x7f6da70762e0 mesos::internal::ZooKeeperMasterDetectorProcess::connected()
            @ 0x7f6da7077298 std::tr1::_Function_handler<>::_M_invoke()
            @ 0x7f6da7078e14 std::tr1::_Function_handler<>::_M_invoke()
            @ 0x7f6da7221d7b std::tr1::function<>::operator()()
            @ 0x7f6da71daf9f process::ProcessBase::visit()
            @ 0x7f6da71efec8 process::DispatchEvent::visit()
            @ 0x7f6da71e411d process::ProcessManager::resume()
            @ 0x7f6da71e4968 process::schedule()
            @ 0x7f6da68fd73d start_thread
            @ 0x7f6da52e1f6d clone

      Here's another case where we don't handle operation timeout (this was on 0.12.0 but we have a similar issue on 0.13.0 when checking for existence of a node, see: MESOS-409):
      F0720 14:29:33.791141 28951 detector.cpp:315] Failed to create '/home/mesos/prod/master' in ZooKeeper: operation timeout

          • Check failure stack trace: ***
            2013-07-20 14:29:33,792:28924(0x7ff109727940):ZOO_ERROR@handle_socket_error_msg@1621: Socket [10.35.97.137:2181] zk retcode=-112, errno=116(Stale NFS file handle): s
            essionId=0x23f2f2c0d7eb014 has expired.
            2013-07-20 14:29:33,792:28924(0x7ff109727940):ZOO_DEBUG@handle_error@1138: Calling a watcher for a ZOO_SESSION_EVENT and the state=ZOO_EXPIRED_SESSION_STATE
            2013-07-20 14:29:33,793:28924(0x7ff109727940):ZOO_DEBUG@do_io@317: IO thread terminated
            2013-07-20 14:29:33,793:28924(0x7ff108f26940):ZOO_DEBUG@process_completions@1765: Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
            @ 0x7ff11691969d google::LogMessage::Fail()
            @ 0x7ff11691f307 google::LogMessage::SendToLog()
            @ 0x7ff11691af4c google::LogMessage::Flush()
            @ 0x7ff11691b1b6 google::LogMessageFatal::~LogMessageFatal()
            @ 0x7ff1166696ad mesos::internal::ZooKeeperMasterDetectorProcess::connected()
            @ 0x7ff11666a298 std::tr1::_Function_handler<>::_M_invoke()
            @ 0x7ff11666be14 std::tr1::_Function_handler<>::_M_invoke()
            @ 0x7ff116814d7b std::tr1::function<>::operator()()
            @ 0x7ff1167cdf9f process::ProcessBase::visit()
            @ 0x7ff1167e2ec8 process::DispatchEvent::visit()
            @ 0x7ff1167d711d process::ProcessManager::resume()
            @ 0x7ff1167d7968 process::schedule()
            @ 0x7ff115eef83d start_thread
            @ 0x7ff1148d1f8d clone

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                xujyan Yan Xu
                Reporter:
                bmahler Benjamin Mahler
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: