Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Not a Problem
    • Affects Version/s: 3.3.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:
    • Tags:
      zookeeper outage postmortem

      Description

      [Moving a thread from the zookeeper-user]

      RECOVERY
      We eventually recovered from this situation by shutting down clients. Initially I tried restarting the Zookeepers, however, they were getting hammered and I believe sessions timing out. I shut down ~2k clients (lightweight python app; simply sets one data watch and takes an action when it changes) at which point zktop could make a connection and a leader election was verified. After resetting latency stats the numbers were very good. I do not believe it would have ever recovered without shedding load.

      QUORUM/ELECTIONS DURING EVENT
      Unfortunately I do not have logs from the event We had debug logging on, and logrotate configured to keep 10 100MB files, and the interesting parts rotated away. I have already switched to info logging so we don't lose this data again.

      During the incident I was not able to view cluster status with zktop, and never observed a successful operation beyond connections, which quickly timed out.

      GC PAUSE/LOGGING
      This is a very good question. No, Zookeeper GC is not tuned and uses whatever the defaults are in the start scripts. GC logging is not enabled either. I filed an internal bug against myself to enable logging, and tune GC.

      CLIENT SESSION TIMEOUTS
      Clients are not explicitly setting timeouts, and I believe sessionTimeout is 10 seconds based on this log entry when initially connecting.

      2010-07-01 05:14:00,260:44267(0x2af330240110):ZOO_INFO@zookeeper_init@727: Initiating client connection, host=10.209.21.133:2181,10.209.21.175:2181,10.209.21.181:2181 sessionTimeout=10000 watcher=(nil) sessionId=0 sessionPasswd=<null> context=(nil) flags=0

      CLIENT BACKOFFS
      Looking in application logs, we see lots of the following:

      2010-07-01 05:13:14,674:41491(0x41ebf940):ZOO_ERROR@handle_socket_error_msg@1528: Socket [10.209.21.181:2181] zk retcode=-7, errno=110(Connection timed out): connection timed out (exceeded timeout by 0ms)

      Doing some simple aggregations we see 130 errors in a ~13 minute sample period. This behavior on thousands of clients appears to have been a DDoS attack against Zookeeper. Using exponential behavior as the default behavior seems appropriate looking at this data.

      Fulltext of the client errors is attached. I grepped "errno" from a Python client log; I believe it uses the same underlying C library, so I did not include example output from a C program (though I can if needed). It looks basically the same.

      GOING FORWARD
      The long-GC pause causing clients to dogpile sounds like the most plausible explanation at this time. GC logging/tuning is clearly where I dropped the ball, just using the defaults; I don't think any changes should be made related to lack of tuning.

      Exponential backoffs does seem like a good idea, and generally useful for most people. There will always be service interruptions and backoffs would be a great preventive measure to get out of a dogpile situation.

      Patrick's message:
      """
      Hi Travis, as Flavio suggested would be great to get the logs. A few questions:

      1) how did you eventually recover, restart the zk servers?

      2) was the cluster losing quorum during this time? leader re-election?

      3) Any chance this could have been initially triggered by a long GC pause on one of the servers? (is gc logging turned on, any sort of heap monitoring?) Has the GC been tuned on the servers, for example CMS and incremental?

      4) what are the clients using for timeout on the sessions?

      3.4 probably not for a few months yet, but we are planning for a 3.3.2 in a few weeks to fix a couple critical issues (which don't seem related to what you saw). If we can identify the problem here we should be able to include it in any fix release we do.

      fixing something like 517 might help, but it's not clear how we got to this state in the first place. fixing 517 might not have any effect if the root cause is not addressed. 662 has only ever been reported once afaik, and we weren't able to identify the root cause for that one.

      One thing we might also consider is modifying the zk client lib to backoff connection attempts if they keep failing (timing out say). Today the clients are pretty aggressive on reconnection attempts. Having some sort of backoff (exponential?) would provide more breathing room to the server in this situation.

      Patrick
      """

      Flavio's message:
      """
      Hi Travis, Do you think it would be possible for you to open a jira and upload your logs?

      Thanks,
      -Flavio
      """

      My initial message:
      """
      Hey zookeepers -

      We just experienced a total zookeeper outage, and here's a quick
      post-mortem of the issue, and some questions about preventing it going
      forward. Quick overview of the setup:

      • RHEL5 2.6.18 kernel
      • Zookeeper 3.3.0
      • ulimit raised to 65k files
      • 3 cluster members
      • 4-5k connections in steady-state
      • Primarily C and python clients, plus some java

      In chronological order, the issue manifested itself as alert about RW
      tests failing. Logs were full of too many files errors, and the output
      of netstat showed lots of CLOSE_WAIT and SYN_RECV sockets. CPU was
      100%. Application logs showed lots of connection timeouts. This
      suggests an event happened that caused applications to dogpile on
      Zookeeper, and eventually the CLOSE_WAIT timeout caused file handles
      to run out and basically game over.

      I looked through lots of logs (clients+servers) and did not see a
      clear indication of what happened. Graphs show a sudden decrease in
      network traffic when the outage began, zookeeper goes cpu bound, and
      runs our of file descriptors.

      Clients are primarily a couple thousand C clients using default
      connection parameters, and a couple thousand python clients using
      default connection parameters.

      Digging through Jira we see two issues that probably contributed to this outage:

      https://issues.apache.org/jira/browse/ZOOKEEPER-662
      https://issues.apache.org/jira/browse/ZOOKEEPER-517

      Both are tagged for the 3.4.0 release. Anyone know if that's still the
      case, and when 3.4.0 is roughly scheduled to ship?

      Thanks!
      Travis
      """

      1. client-errors.txt
        26 kB
        Travis Crawford
      2. zookeeper-out-of-files.netstat
        2.42 MB
        Travis Crawford

        Issue Links

          Activity

          Jeff Hammerbacher made changes -
          Link This issue is related to ZOOKEEPER-803 [ ZOOKEEPER-803 ]
          Hide
          Travis Crawford added a comment -

          Here's the bugfix message:

          """
          commit 2b59b4d4b4f2db0de3b01daeed974754a4cfe9d5
          Author: marius a. eriksen <marius@twitter.com>
          Date: Thu Jul 1 18:16:36 2010 -0700

          Fix ZK client expiry: the code had a mixup of Expired vs. Disconnected
          .

          In terms of ZK:
          Disconnected = the client disconnected, but the library will reconnect
          Expired = the client disconnected, and you need to reconnect
          """

          Looking at the commit there was a bug that did not rate limit reconnects when sessions expire. This is different than the client we're "blessing", and a big reason why we should just choose one and stick with it

          Re: contrib, I'll float that with the team. I'm not sure what the change rate is – if its slow yeah we should go with contrib. If its going to churn still, we should hold off.

          Show
          Travis Crawford added a comment - Here's the bugfix message: """ commit 2b59b4d4b4f2db0de3b01daeed974754a4cfe9d5 Author: marius a. eriksen <marius@twitter.com> Date: Thu Jul 1 18:16:36 2010 -0700 Fix ZK client expiry: the code had a mixup of Expired vs. Disconnected . In terms of ZK: Disconnected = the client disconnected, but the library will reconnect Expired = the client disconnected, and you need to reconnect """ Looking at the commit there was a bug that did not rate limit reconnects when sessions expire. This is different than the client we're "blessing", and a big reason why we should just choose one and stick with it Re: contrib, I'll float that with the team. I'm not sure what the change rate is – if its slow yeah we should go with contrib. If its going to churn still, we should hold off.
          Hide
          Patrick Hunt added a comment -

          Ah, that's great to hear.

          I would really appreciate it though if you (Travis) could enter a new JIRA. Basically I consider it a serious problem that the server failed in this situation. Obv fixing ZOOKEEPER-517 would help in this situation, but really we need to ensure that mis-behaving clients will not bring down the service. That's a serious bug. Please enter a new jira, if you could describe what the problem was on the client side that will help us to craft a test that will verify.

          Btw, I did add a new wiki page to capture available bindings:
          http://wiki.apache.org/hadoop/ZooKeeper/ZKClientBindings

          However I encourage you (anyone) to submit back the code as a contrib.

          Regards.

          Show
          Patrick Hunt added a comment - Ah, that's great to hear. I would really appreciate it though if you (Travis) could enter a new JIRA. Basically I consider it a serious problem that the server failed in this situation. Obv fixing ZOOKEEPER-517 would help in this situation, but really we need to ensure that mis-behaving clients will not bring down the service. That's a serious bug. Please enter a new jira, if you could describe what the problem was on the client side that will help us to craft a test that will verify. Btw, I did add a new wiki page to capture available bindings: http://wiki.apache.org/hadoop/ZooKeeper/ZKClientBindings However I encourage you (anyone) to submit back the code as a contrib. Regards.
          Hide
          Mahadev konar added a comment -

          travis, can you elaborate on the misbehaved client? What was its exact behavior? We just want to make sure if its indeed the misbehaved client or do we need to fix something on our side as well?

          Show
          Mahadev konar added a comment - travis, can you elaborate on the misbehaved client? What was its exact behavior? We just want to make sure if its indeed the misbehaved client or do we need to fix something on our side as well?
          Travis Crawford made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Assignee Travis Crawford [ traviscrawford ]
          Resolution Not A Problem [ 8 ]
          Hide
          Travis Crawford added a comment -

          @Patrick - So this morning while investigating I discovered that yes, the client you pointed out was indeed misbehaving. It was a new application that had a bug.

          I talked with a couple teams and basically everyone is using a different Scala client. We've decided http://github.com/jcorwin/zookeeper-client is our best scala client and are forking from the main twitter account so everyone knows to use it.

          I looked around the docs and didn't see a place to list the scala client; please let me know if we should list this somewhere.

          Closing this issue as its basically a client bug on our side. Thanks for the comments and help tracking this down.

          Show
          Travis Crawford added a comment - @Patrick - So this morning while investigating I discovered that yes, the client you pointed out was indeed misbehaving. It was a new application that had a bug. I talked with a couple teams and basically everyone is using a different Scala client. We've decided http://github.com/jcorwin/zookeeper-client is our best scala client and are forking from the main twitter account so everyone knows to use it. I looked around the docs and didn't see a place to list the scala client; please let me know if we should list this somewhere. Closing this issue as its basically a client bug on our side. Thanks for the comments and help tracking this down.
          Hide
          Patrick Hunt added a comment -

          Travis, one more thing:

          egrep 10.209.44.182 zookeeper-out-of-files.netstat |egrep EST |wc -l
          66
          

          10.209.44.182 has 66 connections established to this server. Why so many? Looks like a potential problem with the client side code, should it be opening more than 1 session concurrently?

          Show
          Patrick Hunt added a comment - Travis, one more thing: egrep 10.209.44.182 zookeeper-out-of-files.netstat |egrep EST |wc -l 66 10.209.44.182 has 66 connections established to this server. Why so many? Looks like a potential problem with the client side code, should it be opening more than 1 session concurrently?
          Hide
          Patrick Hunt added a comment -

          Travis, I see the address 10.209.44.182 2140 times in the netstat file you provided.

          Can you provide some insight on this client - what binding is this, c or java? python, ruby... what? Also, if this client fails (say it segvs or exits for whatever reason) is it automatically restarted by a supervisor? Basically what I'm trying to figure out is why this client should be attempting so many connections to the server - the client libs we provide typically have a "hold off" of a second or so btw connection attempts in order that we don't overload the server in this way. However if the client were to be restarted, or if there was a bug in the code where connections were established in a tight loop (I've seen a couple client side bugs like this now, where the client code thought it failed to get a connection and therefore created a new ZK client, which cause DOS on the server (thousands of clients were created by each client process), this is one of the reasons why we added the maxclientcnxns check.), or a bug in our client library this could explain why we see so many TIME_WAIT state sockets in the netstat file.

          Show
          Patrick Hunt added a comment - Travis, I see the address 10.209.44.182 2140 times in the netstat file you provided. Can you provide some insight on this client - what binding is this, c or java? python, ruby... what? Also, if this client fails (say it segvs or exits for whatever reason) is it automatically restarted by a supervisor? Basically what I'm trying to figure out is why this client should be attempting so many connections to the server - the client libs we provide typically have a "hold off" of a second or so btw connection attempts in order that we don't overload the server in this way. However if the client were to be restarted, or if there was a bug in the code where connections were established in a tight loop (I've seen a couple client side bugs like this now, where the client code thought it failed to get a connection and therefore created a new ZK client, which cause DOS on the server (thousands of clients were created by each client process), this is one of the reasons why we added the maxclientcnxns check.), or a bug in our client library this could explain why we see so many TIME_WAIT state sockets in the netstat file.
          Hide
          Patrick Hunt added a comment -

          Travis, I noticed an issue which I commented on in ZOOKEEPER-517, it's unfortunate that the logs rolled, if this happens again perhaps try looking for this in the logs (from 517):

          LOG.warn("Too many connections from " + ia + " - max is " + maxClientCnxns );

          If a client starts DOSing the server we'll close the connections, however the linger default is 60seconds so in some cases the client could still cause the server to run out of fds. (something we should fix)

          Show
          Patrick Hunt added a comment - Travis, I noticed an issue which I commented on in ZOOKEEPER-517 , it's unfortunate that the logs rolled, if this happens again perhaps try looking for this in the logs (from 517): LOG.warn("Too many connections from " + ia + " - max is " + maxClientCnxns ); If a client starts DOSing the server we'll close the connections, however the linger default is 60seconds so in some cases the client could still cause the server to run out of fds. (something we should fix)
          Hide
          Patrick Hunt added a comment -

          Linking this issue to ZOOKEEPER-517 - may be related.

          Show
          Patrick Hunt added a comment - Linking this issue to ZOOKEEPER-517 - may be related.
          Patrick Hunt made changes -
          Link This issue is related to ZOOKEEPER-517 [ ZOOKEEPER-517 ]
          Hide
          Travis Crawford added a comment -

          I believe clients were reconnecting so frequently because their
          connections timed out.

          Show
          Travis Crawford added a comment - I believe clients were reconnecting so frequently because their connections timed out.
          Hide
          Mahadev konar added a comment -

          the CLOSE_WAIT would probably be casued if the clients connect to servers and then disconnect on there own. This way the server will take some time closing the TCP connections. Any reason that the clients would shuttle arnd the servers so frequently?

          Show
          Mahadev konar added a comment - the CLOSE_WAIT would probably be casued if the clients connect to servers and then disconnect on there own. This way the server will take some time closing the TCP connections. Any reason that the clients would shuttle arnd the servers so frequently?
          Travis Crawford made changes -
          Attachment zookeeper-out-of-files.netstat [ 12448505 ]
          Hide
          Travis Crawford added a comment -

          This attached netstat output shows a large number of: SYN_RECV CLOSE_WAIT. The fast connection churn appears to have consumed all available file handles.

          Show
          Travis Crawford added a comment - This attached netstat output shows a large number of: SYN_RECV CLOSE_WAIT. The fast connection churn appears to have consumed all available file handles.
          Travis Crawford made changes -
          Field Original Value New Value
          Attachment client-errors.txt [ 12448503 ]
          Hide
          Travis Crawford added a comment -

          client-errors.txt contains the zookeeper error messages during an outage. The interesting part is how frequently they occur. This suggests having clients backoff during issues would be an improvement, as this many connection so quickly makes things worse.

          Of course, there's a tradeoff between quickly connecting and backing off. For this reason, perhaps exponential backoff should be the default behavior, but clients that that require fast reconnections could optionally use the current mode.

          Show
          Travis Crawford added a comment - client-errors.txt contains the zookeeper error messages during an outage. The interesting part is how frequently they occur. This suggests having clients backoff during issues would be an improvement, as this many connection so quickly makes things worse. Of course, there's a tradeoff between quickly connecting and backing off. For this reason, perhaps exponential backoff should be the default behavior, but clients that that require fast reconnections could optionally use the current mode.
          Travis Crawford created issue -

            People

            • Assignee:
              Travis Crawford
              Reporter:
              Travis Crawford
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development