ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-885

Zookeeper drops connections under moderate IO load

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Critical Critical
    • Resolution: Unresolved
    • Affects Version/s: 3.2.2, 3.3.1
    • Fix Version/s: 3.5.0
    • Component/s: server
    • Labels:
      None
    • Environment:

      Debian (Lenny)
      1Gb RAM
      swap disabled
      100Mb heap for zookeeper

      Description

      A zookeeper server under minimum load, with a number of clients watching exactly one node will fail to maintain the connection when the machine is subjected to moderate IO load.

      In a specific test example we had three zookeeper servers running on dedicated machines with 45 clients connected, watching exactly one node. The clients would disconnect after moderate load was added to each of the zookeeper servers with the command:

      dd if=/dev/urandom of=/dev/mapper/nimbula-test
      

      The dd command transferred data at a rate of about 4Mb/s.

      The same thing happens with

      dd if=/dev/zero of=/dev/mapper/nimbula-test
      

      It seems strange that such a moderate load should cause instability in the connection.

      Very few other processes were running, the machines were setup to test the connection instability we have experienced. Clients performed no other read or mutation operations.

      Although the documents state that minimal competing IO load should present on the zookeeper server, it seems reasonable that moderate IO should not cause problems in this case.

      1. benchmark.csv
        24 kB
        Alexandre Hardy
      2. tracezklogs.tar.gz
        98 kB
        Alexandre Hardy
      3. tracezklogs.tar.gz
        18 kB
        Alexandre Hardy
      4. zklogs.tar.gz
        64 kB
        Alexandre Hardy
      5. WatcherTest.java
        2 kB
        Alexandre Hardy

        Activity

        Hide
        Alexandre Hardy added a comment -

        I have attached the client used to test the zookeeper servers.

        Show
        Alexandre Hardy added a comment - I have attached the client used to test the zookeeper servers.
        Hide
        Patrick Hunt added a comment -

        I suspect this is being caused by ZOOKEEPER-568 which was fixed in 3.3 releases and later. (you are using 3.2.2 where it's not fixed)

        ZK provides very strong durability guarantees, and needs to wait for fsync to complete before it responds to a write request with "success". I've seen the fsync take significant amounts of time depending on IO load, kernel, filesyste, etc...

        Even though your client isn't doing work, pings are still sent in the b/g. These should not cause a forced disk sync, but they are forcing a sync due to this bug 568. As a result we need to wait for the disk sync to complete before replying to the ping, as you are loading the disk this is taking time. If it takes longer than 3.3 seconds (1/3 the timeout) the client will timeout. (10 sec will cause expiration)

        Can you retry this using zookeeper 3.3.1?

        BTW, You can monitor server latency using the following: http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_zkCommands
        specifically the "stat" command. Check the server avg/max latency before and during your "dd ..." test.

        Show
        Patrick Hunt added a comment - I suspect this is being caused by ZOOKEEPER-568 which was fixed in 3.3 releases and later. (you are using 3.2.2 where it's not fixed) ZK provides very strong durability guarantees, and needs to wait for fsync to complete before it responds to a write request with "success". I've seen the fsync take significant amounts of time depending on IO load, kernel, filesyste, etc... Even though your client isn't doing work, pings are still sent in the b/g. These should not cause a forced disk sync, but they are forcing a sync due to this bug 568. As a result we need to wait for the disk sync to complete before replying to the ping, as you are loading the disk this is taking time. If it takes longer than 3.3 seconds (1/3 the timeout) the client will timeout. (10 sec will cause expiration) Can you retry this using zookeeper 3.3.1? BTW, You can monitor server latency using the following: http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_zkCommands specifically the "stat" command. Check the server avg/max latency before and during your "dd ..." test.
        Hide
        Alexandre Hardy added a comment -

        Thanks Patrick.

        I have now performed the same test with zookeeper-3.3.1 as distributed in the Cloudera CDH3 distribution. My results are fairly similar (I don't have hard numbers for comparison though). I still experience disconnects and session timeouts even though dd transfers at only 5Mb per second.

        I have had improvements in stability with this test by adjusting the following:

        echo 5 > /proc/sys/vm/dirty_ratio
        echo 5 > /proc/sys/vm/dirty_background_ratio
        

        These settings are sufficient to get stability on this simple test.

        I'll work on getting some more concrete comparisons of zookeeper-3.2.2 and zookeeper-3.3.1.

        The output of the "stat" command seems reasonable (0/10/91) during dd. It seems to me that the zookeeper server runs fairly well for a period of time, and then hits a period of instability, and then recovers again. Are pings also counted for server latency?

        I suspect that IO scheduling / buffering may be partly responsible for the problem, and hence started testing the effect of dirty_ratio etc. At this stage I am presuming that dd manages to write a fair amount of data before the data is actually flushed to disk, and that when it is flushed all processes attempting to write to the device are stalled until the flush completes.

        What is unclear to me at this point, is what gets written to disk? What would zookeeper be writing to disk if none of the clients are submitting any requests? Something to do with the session?

        Show
        Alexandre Hardy added a comment - Thanks Patrick. I have now performed the same test with zookeeper-3.3.1 as distributed in the Cloudera CDH3 distribution. My results are fairly similar (I don't have hard numbers for comparison though). I still experience disconnects and session timeouts even though dd transfers at only 5Mb per second. I have had improvements in stability with this test by adjusting the following: echo 5 > /proc/sys/vm/dirty_ratio echo 5 > /proc/sys/vm/dirty_background_ratio These settings are sufficient to get stability on this simple test. I'll work on getting some more concrete comparisons of zookeeper-3.2.2 and zookeeper-3.3.1. The output of the "stat" command seems reasonable (0/10/91) during dd . It seems to me that the zookeeper server runs fairly well for a period of time, and then hits a period of instability, and then recovers again. Are pings also counted for server latency? I suspect that IO scheduling / buffering may be partly responsible for the problem, and hence started testing the effect of dirty_ratio etc. At this stage I am presuming that dd manages to write a fair amount of data before the data is actually flushed to disk, and that when it is flushed all processes attempting to write to the device are stalled until the flush completes. What is unclear to me at this point, is what gets written to disk? What would zookeeper be writing to disk if none of the clients are submitting any requests? Something to do with the session?
        Hide
        Patrick Hunt added a comment -

        A question on the user list this AM triggered a brain cell, have you tuned "maxClientCnxns" in your server configuration?

        See this configuration param in the docs "maxClientCnxns":
        http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_advancedConfiguration

        If all your clients are from a single IP, and if this server config is not changed from the default (10) then you will only be able to maintain 30 sessions, with a max of 10 sessions from a single IP per server.

        You can check for something like the following in your server log files:

        Too many connections from /########## - max is ######

        (### replaced with ipaddr and max setting respectively)

        Also, your comments included the following for latency from the server "(0/10/91)", which means that the max session response time to the client you are seeing is 91 milliseconds. This rules out the server being responsible for causing the clients to expire. If the max was above your timeout (10sec) then it might be possible, but the server is having no problems replying to the client in this case, even under the load you are applying.

        Can you attach one/more of your server logs which display the problem (ie during the time the clients are getting expired). That will help to track this down.

        Show
        Patrick Hunt added a comment - A question on the user list this AM triggered a brain cell, have you tuned "maxClientCnxns" in your server configuration? See this configuration param in the docs "maxClientCnxns": http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_advancedConfiguration If all your clients are from a single IP, and if this server config is not changed from the default (10) then you will only be able to maintain 30 sessions, with a max of 10 sessions from a single IP per server. You can check for something like the following in your server log files: Too many connections from /########## - max is ###### (### replaced with ipaddr and max setting respectively) Also, your comments included the following for latency from the server "(0/10/91)", which means that the max session response time to the client you are seeing is 91 milliseconds. This rules out the server being responsible for causing the clients to expire. If the max was above your timeout (10sec) then it might be possible, but the server is having no problems replying to the client in this case, even under the load you are applying. Can you attach one/more of your server logs which display the problem (ie during the time the clients are getting expired). That will help to track this down.
        Hide
        Alexandre Hardy added a comment -

        maxClientCnxns is set to 30, so 45 clients spread across 3 servers should not be unreasonable, and I do have confirmation that a session is established for every client (all 45 of them) before beginning the disk load with dd.

        I'm aiming for 0 disconnects with this simple example.

        Show
        Alexandre Hardy added a comment - maxClientCnxns is set to 30, so 45 clients spread across 3 servers should not be unreasonable, and I do have confirmation that a session is established for every client (all 45 of them) before beginning the disk load with dd . I'm aiming for 0 disconnects with this simple example.
        Hide
        Alexandre Hardy added a comment -

        Attached are logs from the two sessions with disconnects. I have not filtered the logs in any way. The logs for 3.3.1 are the most clear, and have exactly one failure roughly 3 minutes after the logs start. Unfortunately the logs don't offer much information (as far as I can make out). Should I enable more verbose logging?

        Show
        Alexandre Hardy added a comment - Attached are logs from the two sessions with disconnects. I have not filtered the logs in any way. The logs for 3.3.1 are the most clear, and have exactly one failure roughly 3 minutes after the logs start. Unfortunately the logs don't offer much information (as far as I can make out). Should I enable more verbose logging?
        Hide
        Patrick Hunt added a comment - - edited

        I do have confirmation that a session is established for every client (all 45 of them) before beginning the disk load with dd.

        I see, I was just trying to reduce variables. that should be fine then.

        I see this in the logs:
        2010-10-07 14:49:13,956 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Environment@97] - Server environment:java.version=1.6.0_0
        2010-10-07 14:49:13,960 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Environment@97] - Server environment:java.vendor=Sun Microsystems Inc.
        2010-10-07 14:49:13,960 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Environment@97] - Server environment:java.home=/usr/lib/jvm/java-6-openjdk/jre

        I'm not sure many users are running openjdk, also 1.6.0_0 is very old (I have 1.6.0_18 openjdk on my system). You should upgrade to a recent version of openjdk at the least, although I'd highly suggest running with the official (and recent) sun jdk. (again, this is to reduce variables)

        Also I noticed this in the server log for 1 server, it seems to be misconfigured, perhaps you can fix that? (normal_3.3.1/192.168.131.12.log)

        2010-10-07 14:49:13,979 - FATAL [main:QuorumPeerMain@83] - Invalid config, exiting abnormally

        Should I enable more verbose logging?

        Yes, give that a try, perhaps run with TRACE logging turned on. If you can upload one of those logs I'll take a look.

        Right now we have this in the server log:

        2010-10-07 14:51:32,961 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@633] - EndOfStreamException: Unable to read additional data from client sessionid 0x22b872ad9ff000c, likely client has closed socket
        2010-10-07 14:51:32,962 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1434] - Closed socket connection for client /10.23.4.95:59738 which had sessionid 0x22b872ad9ff000c

        This indicates that the client is closing the connection (EOS).

        Please capture the logs on your client and upload one of them. Perhaps run that at TRACE level as well. That will give us more insight into why the client is closing it's side of the connection (at least from the server's perspective).

        Thanks for the help on this!

        Show
        Patrick Hunt added a comment - - edited I do have confirmation that a session is established for every client (all 45 of them) before beginning the disk load with dd. I see, I was just trying to reduce variables. that should be fine then. I see this in the logs: 2010-10-07 14:49:13,956 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Environment@97] - Server environment:java.version=1.6.0_0 2010-10-07 14:49:13,960 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Environment@97] - Server environment:java.vendor=Sun Microsystems Inc. 2010-10-07 14:49:13,960 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Environment@97] - Server environment:java.home=/usr/lib/jvm/java-6-openjdk/jre I'm not sure many users are running openjdk, also 1.6.0_0 is very old (I have 1.6.0_18 openjdk on my system). You should upgrade to a recent version of openjdk at the least, although I'd highly suggest running with the official (and recent) sun jdk. (again, this is to reduce variables) Also I noticed this in the server log for 1 server, it seems to be misconfigured, perhaps you can fix that? (normal_3.3.1/192.168.131.12.log) 2010-10-07 14:49:13,979 - FATAL [main:QuorumPeerMain@83] - Invalid config, exiting abnormally Should I enable more verbose logging? Yes, give that a try, perhaps run with TRACE logging turned on. If you can upload one of those logs I'll take a look. Right now we have this in the server log: 2010-10-07 14:51:32,961 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@633] - EndOfStreamException: Unable to read additional data from client sessionid 0x22b872ad9ff000c, likely client has closed socket 2010-10-07 14:51:32,962 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1434] - Closed socket connection for client /10.23.4.95:59738 which had sessionid 0x22b872ad9ff000c This indicates that the client is closing the connection (EOS). Please capture the logs on your client and upload one of them. Perhaps run that at TRACE level as well. That will give us more insight into why the client is closing it's side of the connection (at least from the server's perspective). Thanks for the help on this!
        Hide
        Alexandre Hardy added a comment -

        I accidentally missed the configuration for one of the nodes (switching from 3.2.2 to 3.3.1). Thanks for spotting that.

        Here are updated log files. I have updated the test program to terminate as soon as any single connection is dropped (Since my goal is to have zero connection failures for this simple test).

        The client and server clocks are a bit out of sync (about 4 minutes, with the servers logging in UTC and client logging in UTC+2), but the test has been set up so that only one "instability event" is recorded.

        Show
        Alexandre Hardy added a comment - I accidentally missed the configuration for one of the nodes (switching from 3.2.2 to 3.3.1). Thanks for spotting that. Here are updated log files. I have updated the test program to terminate as soon as any single connection is dropped (Since my goal is to have zero connection failures for this simple test). The client and server clocks are a bit out of sync (about 4 minutes, with the servers logging in UTC and client logging in UTC+2), but the test has been set up so that only one "instability event" is recorded.
        Hide
        Alexandre Hardy added a comment -

        I failed to set up the debug logs correctly in the previous attachment. This attachment has full trace logs.

        Show
        Alexandre Hardy added a comment - I failed to set up the debug logs correctly in the previous attachment. This attachment has full trace logs.
        Hide
        Patrick Hunt added a comment -

        Flavio can you take a look at the updated logs? thanks.

        Show
        Patrick Hunt added a comment - Flavio can you take a look at the updated logs? thanks.
        Hide
        Patrick Hunt added a comment -

        I notice this in the server log (continuous log):

        2010-10-08 08:46:24,789 - DEBUG [ProcessThread:-1:CommitProcessor@169] - Processing request:: sessionid:0x32b8b03e21e000c type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a
        2010-10-08 08:46:24,789 - DEBUG [CommitProcessor:3:FinalRequestProcessor@78] - Processing request:: sessionid:0x32b8b03e21e000c type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a
        2010-10-08 08:46:24,789 - DEBUG [CommitProcessor:3:FinalRequestProcessor@160] - sessionid:0x32b8b03e21e000c type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a
        2010-10-08 08:46:27,986 - DEBUG [ProcessThread:-1:CommitProcessor@169] - Processing request:: sessionid:0x32b8b03e21e0001 type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a
        2010-10-08 08:46:38,000 - INFO [SessionTracker:ZooKeeperServer@315] - Expiring session 0x32b8b03e21e0004, timeout of 10000ms exceeded
        2010-10-08 08:46:46,471 - INFO [SessionTracker:ZooKeeperServer@315] - Expiring session 0x32b8b03e21e0003, timeout of 10000ms exceeded
        2010-10-08 08:47:00,083 - INFO [SessionTracker:ZooKeeperServer@315] - Expiring session 0x32b8b03e21e0001, timeout of 10000ms exceeded

        It looks to me like the pipeline is getting stalled after 0x32b8b03e21e000c. Notice that 0x32b8b03e21e0001 sends in a ping which gets to the commit processor, but you never see the "final request processor" messages (before the session expires). Notice that 30seconds are elapsing here.

        Show
        Patrick Hunt added a comment - I notice this in the server log (continuous log): 2010-10-08 08:46:24,789 - DEBUG [ProcessThread:-1:CommitProcessor@169] - Processing request:: sessionid:0x32b8b03e21e000c type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a 2010-10-08 08:46:24,789 - DEBUG [CommitProcessor:3:FinalRequestProcessor@78] - Processing request:: sessionid:0x32b8b03e21e000c type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a 2010-10-08 08:46:24,789 - DEBUG [CommitProcessor:3:FinalRequestProcessor@160] - sessionid:0x32b8b03e21e000c type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a 2010-10-08 08:46:27,986 - DEBUG [ProcessThread:-1:CommitProcessor@169] - Processing request:: sessionid:0x32b8b03e21e0001 type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a 2010-10-08 08:46:38,000 - INFO [SessionTracker:ZooKeeperServer@315] - Expiring session 0x32b8b03e21e0004, timeout of 10000ms exceeded 2010-10-08 08:46:46,471 - INFO [SessionTracker:ZooKeeperServer@315] - Expiring session 0x32b8b03e21e0003, timeout of 10000ms exceeded 2010-10-08 08:47:00,083 - INFO [SessionTracker:ZooKeeperServer@315] - Expiring session 0x32b8b03e21e0001, timeout of 10000ms exceeded It looks to me like the pipeline is getting stalled after 0x32b8b03e21e000c. Notice that 0x32b8b03e21e0001 sends in a ping which gets to the commit processor, but you never see the "final request processor" messages (before the session expires). Notice that 30seconds are elapsing here.
        Hide
        Flavio Junqueira added a comment -

        I remember a while back fixing an issue with CommitProcessor, which was being killed by a runtime exception. As Pat pointed out, it does look like the pipeline is stalling, but it is still unclear why and I couldn't find anything that can indicate the cause.

        Let me try to reproduce it.

        Show
        Flavio Junqueira added a comment - I remember a while back fixing an issue with CommitProcessor, which was being killed by a runtime exception. As Pat pointed out, it does look like the pipeline is stalling, but it is still unclear why and I couldn't find anything that can indicate the cause. Let me try to reproduce it.
        Hide
        Flavio Junqueira added a comment -

        Hi Alexandre, When you load the machines running the zookeeper servers by running the dd command, how much time elapses between running dd and observing the connections expiring? I'm not being able to reproduce it, and I wonder how long the problem takes to manifest.

        Show
        Flavio Junqueira added a comment - Hi Alexandre, When you load the machines running the zookeeper servers by running the dd command, how much time elapses between running dd and observing the connections expiring? I'm not being able to reproduce it, and I wonder how long the problem takes to manifest.
        Hide
        Patrick Hunt added a comment - - edited

        Alexandre, Can you provide some additional detail:

        1) you are applying the load using dd to all three servers at the same time, is that correct? (not just to 1 server)

        2) /dev/mapper indicates some sort of lvm setup, can you give more detail on that? (fyi http://ubuntuforums.org/showthread.php?t=646340)

        3) you mentioned that this:

        echo 5 > /proc/sys/vm/dirty_ratio
        echo 5 > /proc/sys/vm/dirty_background_ratio

        resulting in "stability in this test", can you tell us what this was set to initially?

        Checkout this article: http://lwn.net/Articles/216853/

        I notice you are running a "bigmem" kernel. What's the total memory size? How large of a heap have to assigned to the ZK server? (jvm)

        4) Can you verify whether or not the JVM is swapping? Any chance that the server JVM is swapping, which is causing the server to pause, which then causes the clients to time out? This seems to me like it would fit the scenario - esp given that when you turn the "dirty_ratio" down you see stability increase (the time it would take to complete the flush would decrease, meaning that the server can respond before the client times out).

        Show
        Patrick Hunt added a comment - - edited Alexandre, Can you provide some additional detail: 1) you are applying the load using dd to all three servers at the same time, is that correct? (not just to 1 server) 2) /dev/mapper indicates some sort of lvm setup, can you give more detail on that? (fyi http://ubuntuforums.org/showthread.php?t=646340 ) 3) you mentioned that this: echo 5 > /proc/sys/vm/dirty_ratio echo 5 > /proc/sys/vm/dirty_background_ratio resulting in "stability in this test", can you tell us what this was set to initially? Checkout this article: http://lwn.net/Articles/216853/ I notice you are running a "bigmem" kernel. What's the total memory size? How large of a heap have to assigned to the ZK server? (jvm) 4) Can you verify whether or not the JVM is swapping? Any chance that the server JVM is swapping, which is causing the server to pause, which then causes the clients to time out? This seems to me like it would fit the scenario - esp given that when you turn the "dirty_ratio" down you see stability increase (the time it would take to complete the flush would decrease, meaning that the server can respond before the client times out).
        Hide
        Alexandre Hardy added a comment -

        Hi Alexandre, When you load the machines running the zookeeper servers by running the dd command, how much time elapses between running dd and observing the connections expiring? I'm not being able to reproduce it, and I wonder how long the problem takes to manifest.

        Hi Flavio,

        Problems usually start occurring after about 30 seconds. I have also tested on some other machines and response varies somewhat. I suspect that the 30 seconds is dictated by some queue that needs to fill up before significant disk traffic is initiated. I think that the speed at which this queue is processed determines how likely it is that zookeeper will fail to respond to a ping.

        Please also see responses to Patrick's questions.

        Show
        Alexandre Hardy added a comment - Hi Alexandre, When you load the machines running the zookeeper servers by running the dd command, how much time elapses between running dd and observing the connections expiring? I'm not being able to reproduce it, and I wonder how long the problem takes to manifest. Hi Flavio, Problems usually start occurring after about 30 seconds. I have also tested on some other machines and response varies somewhat. I suspect that the 30 seconds is dictated by some queue that needs to fill up before significant disk traffic is initiated. I think that the speed at which this queue is processed determines how likely it is that zookeeper will fail to respond to a ping. Please also see responses to Patrick's questions.
        Hide
        Alexandre Hardy added a comment -

        Hi Patrick,

        1) you are applying the load using dd to all three servers at the same time, is that correct? (not just to 1 server)

        Correct. If dd is run on only one machine then the likelihood of disconnects is reduced. Unfortunately our typical scenario would involve load on all three machines.

        2) /dev/mapper indicates some sort of lvm setup, can you give more detail on that? (fyi http://ubuntuforums.org/showthread.php?t=646340)

        Yes, we have an lvm setup on a single spindle. The nimbula-test logical volume is 10G in size and (obviously) shares the same spindle as root and log (/var/log) partitions.

        3) you mentioned that this:

        echo 5 > /proc/sys/vm/dirty_ratio
        echo 5 > /proc/sys/vm/dirty_background_ratio

        resulting in "stability in this test", can you tell us what this was set to initially?

        Checkout this article: http://lwn.net/Articles/216853/

        The initial value for /proc/sys/vm/dirty_ratio is 20, an the initial value for /proc/sys/vm/dirty_background_ratio is 10. These machines have 1G of RAM, and thus are less susceptible
        to the problems mentioned in http://lwn.net/Articles/216853/ (as I see it). I have run a more complete benchmark with random IO instead of dd sequential IO testing session timeouts, and the effect of dirty_ratio settings. I will attach that separately. dirty_ratio seems to help with the dd test but has much less influence in the random IO test.

        I notice you are running a "bigmem" kernel. What's the total memory size? How large of a heap have to assigned to the ZK server? (jvm)

        We have 1G on each machine in this test system and 100M heap size for each zookeeper server.

        4) Can you verify whether or not the JVM is swapping? Any chance that the server JVM is swapping, which is causing the server to pause, which then causes the clients to time out? This seems to me like it would fit the scenario - esp given that when you turn the "dirty_ratio" down you see stability increase (the time it would take to complete the flush would decrease, meaning that the server can respond before the client times out).

        I'm not entirely sure of all the JVM internals, but all swap space on the linux system was disabled. So no swapping based on the linux kernel would happen. I'm not sure if the JVM does any swapping of its own?
        I concur with your analysis. What puzzles me is why the system would even get into a state where the zookeeper server would have to wait so long for a disk flush? In the case of dd if=/dev/urandom the IO rate is quite low, and there should (I think) be more than enough IOPS available for zookeeper to flush data to disk in time. Even if the IO scheduling results in this scenario, it is still not clear to me why zookeeper would fail to respond to a ping. My only conclusion at this stage is that responding to a ping requires information to be flushed to disk. Is this correct?

        Referring to your private e-mail:

        > The weird thing here is that there should be no delay for these pings.

        This would indicate to me that the ping response should not be dependent on any disk IO.

        Thanks for all the effort in looking into this!

        Show
        Alexandre Hardy added a comment - Hi Patrick, 1) you are applying the load using dd to all three servers at the same time, is that correct? (not just to 1 server) Correct. If dd is run on only one machine then the likelihood of disconnects is reduced. Unfortunately our typical scenario would involve load on all three machines. 2) /dev/mapper indicates some sort of lvm setup, can you give more detail on that? (fyi http://ubuntuforums.org/showthread.php?t=646340 ) Yes, we have an lvm setup on a single spindle. The nimbula-test logical volume is 10G in size and (obviously) shares the same spindle as root and log (/var/log) partitions. 3) you mentioned that this: echo 5 > /proc/sys/vm/dirty_ratio echo 5 > /proc/sys/vm/dirty_background_ratio resulting in "stability in this test", can you tell us what this was set to initially? Checkout this article: http://lwn.net/Articles/216853/ The initial value for /proc/sys/vm/dirty_ratio is 20, an the initial value for /proc/sys/vm/dirty_background_ratio is 10. These machines have 1G of RAM, and thus are less susceptible to the problems mentioned in http://lwn.net/Articles/216853/ (as I see it). I have run a more complete benchmark with random IO instead of dd sequential IO testing session timeouts, and the effect of dirty_ratio settings. I will attach that separately. dirty_ratio seems to help with the dd test but has much less influence in the random IO test. I notice you are running a "bigmem" kernel. What's the total memory size? How large of a heap have to assigned to the ZK server? (jvm) We have 1G on each machine in this test system and 100M heap size for each zookeeper server. 4) Can you verify whether or not the JVM is swapping? Any chance that the server JVM is swapping, which is causing the server to pause, which then causes the clients to time out? This seems to me like it would fit the scenario - esp given that when you turn the "dirty_ratio" down you see stability increase (the time it would take to complete the flush would decrease, meaning that the server can respond before the client times out). I'm not entirely sure of all the JVM internals, but all swap space on the linux system was disabled. So no swapping based on the linux kernel would happen. I'm not sure if the JVM does any swapping of its own? I concur with your analysis. What puzzles me is why the system would even get into a state where the zookeeper server would have to wait so long for a disk flush? In the case of dd if=/dev/urandom the IO rate is quite low, and there should (I think) be more than enough IOPS available for zookeeper to flush data to disk in time. Even if the IO scheduling results in this scenario, it is still not clear to me why zookeeper would fail to respond to a ping. My only conclusion at this stage is that responding to a ping requires information to be flushed to disk. Is this correct? Referring to your private e-mail: > The weird thing here is that there should be no delay for these pings. This would indicate to me that the ping response should not be dependent on any disk IO. Thanks for all the effort in looking into this!
        Hide
        Alexandre Hardy added a comment -

        Here are some results with different settings of dirty_ratio, dirty_bytes (finer control), session timeouts, and io priorities (set with ionice).

        The status field indicates success or failure, 0=success and anything else is failure.
        Where failure means a zookeeper session disconnected (but did not necessarily expire).

        Each test ran for a maximum of 5 minutes. A test could also fail if it failed to connect to zookeeper servers within the first 60 seconds, unfortunately the return code did not differentiate properly between these cases. However pauses of about 4 seconds were allowed between tests, during which all IO operations (by the test program) were stopped. This should have allowed the system to stabilize somewhat.

        In this test the session timeout had the most significant influence on stability (not too surprising) and the other system settings had far less influence.

        We have settled on a 60s session timeout for the moment (to achieve the stability we need under the IO loads we are experiencing). But it would be great if we could reduce this a bit.

        Show
        Alexandre Hardy added a comment - Here are some results with different settings of dirty_ratio , dirty_bytes (finer control), session timeouts, and io priorities (set with ionice ). The status field indicates success or failure, 0=success and anything else is failure. Where failure means a zookeeper session disconnected (but did not necessarily expire). Each test ran for a maximum of 5 minutes. A test could also fail if it failed to connect to zookeeper servers within the first 60 seconds, unfortunately the return code did not differentiate properly between these cases. However pauses of about 4 seconds were allowed between tests, during which all IO operations (by the test program) were stopped. This should have allowed the system to stabilize somewhat. In this test the session timeout had the most significant influence on stability (not too surprising) and the other system settings had far less influence. We have settled on a 60s session timeout for the moment (to achieve the stability we need under the IO loads we are experiencing). But it would be great if we could reduce this a bit.
        Hide
        Benjamin Reed added a comment -

        we are having problems reproducing this. can you give a bit more details on the machines you are using? what are the cpu and memory size? also, what is the throughput of dd if=/dev/zero of=/dev/mapper/nimbula-test? is there just one disk, where nimbula-test is a partition on that disk and you have another partition for the snapshots and logs?

        even if you don't have swap space, code pages can be discarded and loaded on demand, so that could be a potential problem. what does /proc/meminfo look like?

        Show
        Benjamin Reed added a comment - we are having problems reproducing this. can you give a bit more details on the machines you are using? what are the cpu and memory size? also, what is the throughput of dd if=/dev/zero of=/dev/mapper/nimbula-test? is there just one disk, where nimbula-test is a partition on that disk and you have another partition for the snapshots and logs? even if you don't have swap space, code pages can be discarded and loaded on demand, so that could be a potential problem. what does /proc/meminfo look like?
        Hide
        Dave Wright added a comment -

        I don't think the cause of this is much of a mystery, as we experienced similar problems when we had the zookeeper files on the same filesystem as an IO-heavy application that was doing buffered IO. Quite simply, when zookeeper does a sync on its own files, it causes the entire filesystem to sync, flushing any buffered data from the IO-heavy application and freezing the ZK server process for long enough for heartbeats to timeout.

        When you say "moderate IO load" I'm curious what the bottleneck is - the dd command will copy data as fast as possible, if you're only getting 4MB/sec, the underlying device must be pretty slow, which would further indicate why a sync() request would take a while to complete.

        The only fix we've seen is to put the ZK files on their own device, although you may be able to fix it with a different partition on the same device.

        Show
        Dave Wright added a comment - I don't think the cause of this is much of a mystery, as we experienced similar problems when we had the zookeeper files on the same filesystem as an IO-heavy application that was doing buffered IO. Quite simply, when zookeeper does a sync on its own files, it causes the entire filesystem to sync, flushing any buffered data from the IO-heavy application and freezing the ZK server process for long enough for heartbeats to timeout. When you say "moderate IO load" I'm curious what the bottleneck is - the dd command will copy data as fast as possible, if you're only getting 4MB/sec, the underlying device must be pretty slow, which would further indicate why a sync() request would take a while to complete. The only fix we've seen is to put the ZK files on their own device, although you may be able to fix it with a different partition on the same device.
        Hide
        Flavio Junqueira added a comment -

        I'm not sure it is that simple, Dave. The problem is that pings do not require writes to disk, and in the scenario that Alexandre describes, there are only pings being processed. Why is the background I/O load affecting the processing of ZooKeeper? And in particular, why are session expiring as a consequence of this background I/O load?

        Show
        Flavio Junqueira added a comment - I'm not sure it is that simple, Dave. The problem is that pings do not require writes to disk, and in the scenario that Alexandre describes, there are only pings being processed. Why is the background I/O load affecting the processing of ZooKeeper? And in particular, why are session expiring as a consequence of this background I/O load?
        Hide
        Patrick Hunt added a comment -

        re what Dave just mentioned (fsync issues on ext3 volume), this is a good read:
        http://shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/
        ZK is in a similar situation.

        The perplexing thing here though is that the client is just doing pings, so really there should be no writes/syncs to disk for ZK. That's what I'm stuck on.

        Show
        Patrick Hunt added a comment - re what Dave just mentioned (fsync issues on ext3 volume), this is a good read: http://shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/ ZK is in a similar situation. The perplexing thing here though is that the client is just doing pings, so really there should be no writes/syncs to disk for ZK. That's what I'm stuck on.
        Hide
        Dave Wright added a comment -

        Has it been verified that ZK is doing no disk activity at all during that time? What about log file writes? What about sessions being established/torn down (which would cause syncs)?

        Show
        Dave Wright added a comment - Has it been verified that ZK is doing no disk activity at all during that time? What about log file writes? What about sessions being established/torn down (which would cause syncs)?
        Hide
        Flavio Junqueira added a comment -

        I've been running it and there is no traffic to the disk while the clients are watching. We generate a snapshot every snapCount, but given that there are no transactions generated, no transaction is appended to the log and no new snapshot is written.

        Show
        Flavio Junqueira added a comment - I've been running it and there is no traffic to the disk while the clients are watching. We generate a snapshot every snapCount, but given that there are no transactions generated, no transaction is appended to the log and no new snapshot is written.
        Hide
        Alexandre Hardy added a comment -

        Hi Flavio,

        I've set up some ec2 instances to reproduce the problem. I think the
        problem is related to relative disk performance and load.

        I have had to use a more aggressive disk benchmark utility to get the
        problem to occur, and I realise that this is contrary to the zookeeper
        requirements. However, I think we would like to know why a ping would
        be affected when no disk access is expected.

        Can we discuss access to these instances vie e-mail?

        Kind regards
        Alexandre

        Show
        Alexandre Hardy added a comment - Hi Flavio, I've set up some ec2 instances to reproduce the problem. I think the problem is related to relative disk performance and load. I have had to use a more aggressive disk benchmark utility to get the problem to occur, and I realise that this is contrary to the zookeeper requirements. However, I think we would like to know why a ping would be affected when no disk access is expected. Can we discuss access to these instances vie e-mail? Kind regards Alexandre
        Hide
        Flavio Junqueira added a comment -

        Sure, let's discuss over e-mail and we can post here later our findings.

        Show
        Flavio Junqueira added a comment - Sure, let's discuss over e-mail and we can post here later our findings.
        Hide
        Vishal Kher added a comment -

        I am curious to know what the problem was. Any updates? Thanks.

        Show
        Vishal Kher added a comment - I am curious to know what the problem was. Any updates? Thanks.
        Hide
        Alexandre Hardy added a comment -

        Hi Vishal,

        No progress as yet. I think everyone has been a bit overloaded during the last few weeks.

        Show
        Alexandre Hardy added a comment - Hi Vishal, No progress as yet. I think everyone has been a bit overloaded during the last few weeks.
        Hide
        Michi Mutsuzaki added a comment -

        > I have had to use a more aggressive disk benchmark utility to get the
        > problem to occur, and I realise that this is contrary to the zookeeper
        > requirements. However, I think we would like to know why a ping would
        > be affected when no disk access is expected.

        Which benchmark utility did you use? I'm trying to reproduce this on my cluster.

        Thanks!
        --Michi

        Show
        Michi Mutsuzaki added a comment - > I have had to use a more aggressive disk benchmark utility to get the > problem to occur, and I realise that this is contrary to the zookeeper > requirements. However, I think we would like to know why a ping would > be affected when no disk access is expected. Which benchmark utility did you use? I'm trying to reproduce this on my cluster. Thanks! --Michi
        Hide
        Michi Mutsuzaki (Inactive) added a comment -

        I promised to look into this, but I haven't had time to do it. I'll try to spend some time on this next week.

        --Michi

        Show
        Michi Mutsuzaki (Inactive) added a comment - I promised to look into this, but I haven't had time to do it. I'll try to spend some time on this next week. --Michi
        Hide
        Mahadev konar added a comment -

        not a blocker. Moving it out of 3.4 release.

        Show
        Mahadev konar added a comment - not a blocker. Moving it out of 3.4 release.
        Hide
        Matt Wise added a comment -

        This issue still exists, right? I just ran into this today where several clients were disconnected during a short IO blockage on one of our ZK nodes. Will this be fixed by 3.5.0 please?

        Show
        Matt Wise added a comment - This issue still exists, right? I just ran into this today where several clients were disconnected during a short IO blockage on one of our ZK nodes. Will this be fixed by 3.5.0 please?
        Hide
        Michi Mutsuzaki added a comment -

        I recently tested this with 3.4.5. On my setup with 6 second session timeout, the client doesn't get disconnected even with 100% disk utilization (using iostat) when there is no write traffic to ZooKeeper. When there is write traffic, the connection starts becoming flaky around 80~90% disk utilization.

        We could make the disconnect timeout (2/3 of the session timeout) configurable. The client can keep sending ping to avoid session expiry even if the ping responses get blocked behind a slow write because the session timestamp gets updated as soon as the server receives the ping request. You could also have an additional mechanism for detecting server failures using a 4 letter word command, although this would put additional load to the server.

        I don't have bandwidth to do this for 3.5.0 though. Let me know if anybody is interested in working on this. Otherwise, I'll push this to 3.6.0.

        Show
        Michi Mutsuzaki added a comment - I recently tested this with 3.4.5. On my setup with 6 second session timeout, the client doesn't get disconnected even with 100% disk utilization (using iostat) when there is no write traffic to ZooKeeper. When there is write traffic, the connection starts becoming flaky around 80~90% disk utilization. We could make the disconnect timeout (2/3 of the session timeout) configurable. The client can keep sending ping to avoid session expiry even if the ping responses get blocked behind a slow write because the session timestamp gets updated as soon as the server receives the ping request. You could also have an additional mechanism for detecting server failures using a 4 letter word command, although this would put additional load to the server. I don't have bandwidth to do this for 3.5.0 though. Let me know if anybody is interested in working on this. Otherwise, I'll push this to 3.6.0.

          People

          • Assignee:
            Unassigned
            Reporter:
            Alexandre Hardy
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:

              Development