Description
Observed this in production when ZK connectivity issues were present:
I0508 19:15:48.485126 18915 slave.cpp:951] Process exited: @0.0.0.0:0
W0508 19:15:48.485252 18915 slave.cpp:954] WARNING! Master disconnected! Waiting for a new master to be elected.
2013-05-08 19:15:51,621:18889(0x4ddeb940):ZOO_ERROR@handle_socket_error_msg@1528: Socket [10.40.93.122:2181] zk retcode=-7, errno=110(Connection timed out): connection timed out (exceeded timeout by 3ms)
2013-05-08 19:15:51,621:18889(0x4ddeb940):ZOO_INFO@check_events@1585: initiated connection to server [10.40.89.122:2181]
2013-05-08 19:15:51,623:18889(0x4ddeb940):ZOO_ERROR@handle_socket_error_msg@1621: Socket [10.40.89.122:2181] zk retcode=-112, errno=116(Stale NFS file handle): sessionId=0x33e8587f8b50141 has expired.
2013-05-08 19:15:51,623:18889(0x4ddeb940):ZOO_DEBUG@handle_error@1138: Calling a watcher for a ZOO_SESSION_EVENT and the state=ZOO_EXPIRED_SESSION_STATE
2013-05-08 19:15:51,623:18889(0x4ddeb940):ZOO_DEBUG@do_io@317: IO thread terminated
2013-05-08 19:15:51,623:18889(0x4d5ea940):ZOO_DEBUG@process_completions@1765: Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
W0508 19:15:51.623631 18904 detector.cpp:393] Master detector (slave(1)@IP:PORT) ZooKeeper session expired!
2013-05-08 19:15:51,623:18889(0x4d5ea940):ZOO_DEBUG@do_completion@336: completion thread terminated
2013-05-08 19:15:51,623:18889(0x46ddd940):ZOO_INFO@zookeeper_close@2321: Freeing zookeeper resources for sessionId=0x33e8587f8b50141
2013-05-08 19:15:51,625:18889(0x46ddd940):ZOO_INFO@log_env@658: Client environment:zookeeper.version=zookeeper C client 3.3.4
2013-05-08 19:15:51,626:18889(0x46ddd940):ZOO_INFO@log_env@662: Client environment:host.name=HOSTNAME
2013-05-08 19:15:51,626:18889(0x46ddd940):ZOO_INFO@log_env@669: Client environment:os.name=Linux
2013-05-08 19:15:51,626:18889(0x46ddd940):ZOO_INFO@log_env@670: Client environment:os.arch=2.6.39-t7.el5
2013-05-08 19:15:51,626:18889(0x46ddd940):ZOO_INFO@log_env@671: Client environment:os.version=#1 SMP Tue Oct 23 23:59:44 UTC 2012
2013-05-08 19:15:51,626:18889(0x46ddd940):ZOO_INFO@log_env@679: Client environment:user.name=(null)
2013-05-08 19:15:51,626:18889(0x46ddd940):ZOO_INFO@log_env@687: Client environment:user.home=/root
2013-05-08 19:15:51,626:18889(0x46ddd940):ZOO_INFO@log_env@699: Client environment:user.dir=/
2013-05-08 19:15:51,626:18889(0x46ddd940):ZOO_INFO@zookeeper_init@727: Initiating client connection, host=ZKHOST:2181 sessionTimeout=10000 watcher=0x7f6da70adc00 sessionId=0 sessionPasswd
=<null> context=0x7f6d9c186640 flags=0
2013-05-08 19:15:51,629:18889(0x46ddd940):ZOO_DEBUG@start_threads@152: starting threads...
2013-05-08 19:15:51,629:18889(0x4d5ea940):ZOO_DEBUG@do_io@279: started IO thread
2013-05-08 19:15:51,629:18889(0x4ddeb940):ZOO_DEBUG@do_completion@326: started completion thread
2013-05-08 19:15:51,630:18889(0x4d5ea940):ZOO_INFO@check_events@1585: initiated connection to server [IP:2181]
2013-05-08 19:15:51,632:18889(0x4d5ea940):ZOO_INFO@check_events@1632: session establishment complete on server [ZK_IP:2181], sessionId=0x3e85920347000b, negotiated timeout=10000
2013-05-08 19:15:51,632:18889(0x4d5ea940):ZOO_DEBUG@check_events@1638: Calling a watcher for a ZOO_SESSION_EVENT and the state=ZOO_CONNECTED_STATE
2013-05-08 19:15:51,632:18889(0x4ddeb940):ZOO_DEBUG@process_completions@1765: Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
I0508 19:15:51.632997 18908 detector.cpp:283] Master detector (slave(1)@IP:5051) connected to ZooKeeper ...
I0508 19:15:51.633085 18908 detector.cpp:288] Authenticating to ZooKeeper using scheme 'digest'
2013-05-08 19:15:51,633:18889(0x48de1940):ZOO_DEBUG@send_last_auth_info@1265: Sending auth info request to ZK_IP:2181
I0508 19:15:53.453980 18913 http.cpp:280] HTTP request for '/slave(1)/stats.json'
2013-05-08 19:15:58,302:18889(0x4d5ea940):ZOO_ERROR@handle_socket_error_msg@1528: Socket [ZK_IP:2181] zk retcode=-7, errno=110(Connection timed out): connection timed out (exceeded timeout by 4ms)
2013-05-08 19:15:58,302:18889(0x4d5ea940):ZOO_DEBUG@handle_error@1141: Calling a watcher for a ZOO_SESSION_EVENT and the state=CONNECTING_STATE
2013-05-08 19:15:58,302:18889(0x4ddeb940):ZOO_DEBUG@process_completions@1765: Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
F0508 19:15:58.303388 18908 detector.cpp:291] Failed to authenticate with ZooKeeper: operation timeout
-
-
- Check failure stack trace: ***
@ 0x7f6da732669d google::LogMessage::Fail()
@ 0x7f6da732c307 google::LogMessage::SendToLog()
@ 0x7f6da7327f4c google::LogMessage::Flush()
@ 0x7f6da73281b6 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f6da70762e0 mesos::internal::ZooKeeperMasterDetectorProcess::connected()
@ 0x7f6da7077298 std::tr1::_Function_handler<>::_M_invoke()
@ 0x7f6da7078e14 std::tr1::_Function_handler<>::_M_invoke()
@ 0x7f6da7221d7b std::tr1::function<>::operator()()
@ 0x7f6da71daf9f process::ProcessBase::visit()
@ 0x7f6da71efec8 process::DispatchEvent::visit()
@ 0x7f6da71e411d process::ProcessManager::resume()
@ 0x7f6da71e4968 process::schedule()
@ 0x7f6da68fd73d start_thread
@ 0x7f6da52e1f6d clone
- Check failure stack trace: ***
-
Here's another case where we don't handle operation timeout (this was on 0.12.0 but we have a similar issue on 0.13.0 when checking for existence of a node, see: MESOS-409):
F0720 14:29:33.791141 28951 detector.cpp:315] Failed to create '/home/mesos/prod/master' in ZooKeeper: operation timeout
-
-
- Check failure stack trace: ***
2013-07-20 14:29:33,792:28924(0x7ff109727940):ZOO_ERROR@handle_socket_error_msg@1621: Socket [10.35.97.137:2181] zk retcode=-112, errno=116(Stale NFS file handle): s
essionId=0x23f2f2c0d7eb014 has expired.
2013-07-20 14:29:33,792:28924(0x7ff109727940):ZOO_DEBUG@handle_error@1138: Calling a watcher for a ZOO_SESSION_EVENT and the state=ZOO_EXPIRED_SESSION_STATE
2013-07-20 14:29:33,793:28924(0x7ff109727940):ZOO_DEBUG@do_io@317: IO thread terminated
2013-07-20 14:29:33,793:28924(0x7ff108f26940):ZOO_DEBUG@process_completions@1765: Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
@ 0x7ff11691969d google::LogMessage::Fail()
@ 0x7ff11691f307 google::LogMessage::SendToLog()
@ 0x7ff11691af4c google::LogMessage::Flush()
@ 0x7ff11691b1b6 google::LogMessageFatal::~LogMessageFatal()
@ 0x7ff1166696ad mesos::internal::ZooKeeperMasterDetectorProcess::connected()
@ 0x7ff11666a298 std::tr1::_Function_handler<>::_M_invoke()
@ 0x7ff11666be14 std::tr1::_Function_handler<>::_M_invoke()
@ 0x7ff116814d7b std::tr1::function<>::operator()()
@ 0x7ff1167cdf9f process::ProcessBase::visit()
@ 0x7ff1167e2ec8 process::DispatchEvent::visit()
@ 0x7ff1167d711d process::ProcessManager::resume()
@ 0x7ff1167d7968 process::schedule()
@ 0x7ff115eef83d start_thread
@ 0x7ff1148d1f8d clone
- Check failure stack trace: ***
-
Attachments
Issue Links
- relates to
-
MESOS-814 Retry retryable authentication failures
- Resolved