ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-856

Connection imbalance leads to overloaded ZK instances

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 3.5.0
    • Component/s: None
    • Labels:
      None

      Description

      We've experienced a number of issues lately where "ruok" requests would take upwards of 10 seconds to return, and ZooKeeper instances were extremely sluggish. The sluggish instance requires a restart to make it responsive again.

      I believe the issue is connections are very imbalanced, leading to certain instances having many thousands of connections, while other instances are largely idle.

      A potential solution is periodically disconnecting/reconnecting to balance connections over time; this seems fine because sessions should not be affected, and therefore ephemaral nodes and watches should not be affected.

        Activity

        Hide
        Patrick Hunt added a comment -

        Btw, If it's possible for you to connect visualvm or jconsole to the JVMs in question you would also see some useful information (not always possible though, esp in production env).

        Show
        Patrick Hunt added a comment - Btw, If it's possible for you to connect visualvm or jconsole to the JVMs in question you would also see some useful information (not always possible though, esp in production env).
        Hide
        Patrick Hunt added a comment -

        Agree, this has been an issue we've discussed before (connection balancing) and should address. The most pronounced is when you restart a server - all the connections shift to the other servers and the restarted server has 0 connections...

        Show
        Patrick Hunt added a comment - Agree, this has been an issue we've discussed before (connection balancing) and should address. The most pronounced is when you restart a server - all the connections shift to the other servers and the restarted server has 0 connections...
        Hide
        Travis Crawford added a comment -

        Thanks for the suggestions! I'm trying out the following options:

        -Xmx4000M
        -Xms4000M
        -Xloggc:/var/log/zookeeper/gc_20100827_201441.log
        -XX:+PrintGC
        -XX:+PrintGCDetails
        -XX:+PrintGCTimeStamps
        -XX:+PrintGCApplicationStoppedTime
        -XX:+PrintGCApplicationConcurrentTime
        -XX:+UseConcMarkSweepGC
        -XX:+CMSIncrementalMode
        -XX:+CMSIncrementalPacing

        I still think the connection imbalance issue is worth addressing though. Even with properly tuned ZK servers they can become overloaded as older processes acquire connections to the point of becoming overloaded.

        Show
        Travis Crawford added a comment - Thanks for the suggestions! I'm trying out the following options: -Xmx4000M -Xms4000M -Xloggc:/var/log/zookeeper/gc_20100827_201441.log -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing I still think the connection imbalance issue is worth addressing though. Even with properly tuned ZK servers they can become overloaded as older processes acquire connections to the point of becoming overloaded.
        Hide
        Patrick Hunt added a comment -

        Well it could be that increased connection count correlates to increased activity, including longer gc pauses as a result.

        I believe incremental mode is what you're looking for (just one reference):
        http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html#0.0.0.0.Incremental%20mode%7Coutline

        "The incremental mode is meant to lessen the impact of long concurrent phases by periodically stopping the concurrent phase to yield back the processor to the application."

        I've found this tool to be an excellent way to visualize GC activity: https://gchisto.dev.java.net/

        Also, have you verified that you are not using a large % of the heap? Perhaps you could include some JVM stats (heap utilization say) in your monitoring framework.

        Show
        Patrick Hunt added a comment - Well it could be that increased connection count correlates to increased activity, including longer gc pauses as a result. I believe incremental mode is what you're looking for (just one reference): http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html#0.0.0.0.Incremental%20mode%7Coutline "The incremental mode is meant to lessen the impact of long concurrent phases by periodically stopping the concurrent phase to yield back the processor to the application." I've found this tool to be an excellent way to visualize GC activity: https://gchisto.dev.java.net/ Also, have you verified that you are not using a large % of the heap? Perhaps you could include some JVM stats (heap utilization say) in your monitoring framework.
        Hide
        Travis Crawford added a comment -

        @patrick - We're using these settings, which I believe are based on what's recommended in the troubleshooting guide.

        -XX:+PrintGC
        -XX:+PrintGCDetails
        -XX:+PrintGCTimeStamps
        -XX:+PrintGCApplicationStoppedTime
        -XX:+PrintGCApplicationConcurrentTime
        -XX:+UseConcMarkSweepGC

        Looking at the logs I do see lots of GC activity. For example:

        Total time for which application threads were stopped: 0.5599050 seconds
        Application time: 0.0056590 seconds

        I only see this on the hosts that became unresponsive after acquiring lots of connections.

        Any suggestions for the GC flags? If there's something better I can experiment, and update the wiki if we discover something interesting.

        http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting

        Show
        Travis Crawford added a comment - @patrick - We're using these settings, which I believe are based on what's recommended in the troubleshooting guide. -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+UseConcMarkSweepGC Looking at the logs I do see lots of GC activity. For example: Total time for which application threads were stopped: 0.5599050 seconds Application time: 0.0056590 seconds I only see this on the hosts that became unresponsive after acquiring lots of connections. Any suggestions for the GC flags? If there's something better I can experiment, and update the wiki if we discover something interesting. http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting
        Hide
        Patrick Hunt added a comment -

        Have you monitored the jvms for gc activity? Are you using CMS/incremental gc rather than the default GC setup? I'm all for adding balancing, but it would be good to rule GC/swap/IO out as an issue.

        Show
        Patrick Hunt added a comment - Have you monitored the jvms for gc activity? Are you using CMS/incremental gc rather than the default GC setup? I'm all for adding balancing, but it would be good to rule GC/swap/IO out as an issue.
        Hide
        Travis Crawford added a comment -

        @mahadev - I would love to help test a patch I'm currently using 3.3.1 + ZOOKEEPER-744 + ZOOKEEPER-790, applied in that order.

        If there's a knob for how frequently to disconnect/reconnect I can try out different settings to see what a sensible default would be.

        Do you think this should be a client or server setting? I'm thinking a server setting because otherwise its not possible to enforce the policy.

        Show
        Travis Crawford added a comment - @mahadev - I would love to help test a patch I'm currently using 3.3.1 + ZOOKEEPER-744 + ZOOKEEPER-790 , applied in that order. If there's a knob for how frequently to disconnect/reconnect I can try out different settings to see what a sensible default would be. Do you think this should be a client or server setting? I'm thinking a server setting because otherwise its not possible to enforce the policy.
        Hide
        Mahadev konar added a comment -

        travis,
        we have had a lot of discussion on load balancing. I'd really want to try and see how the disconnect and reconnect works for load balancing. I am also with you that it might be a good enough soln on load balancing. I can upload a simple patch for this. Would you have some bandwidth trying and it out and reporting how well it works?

        Show
        Mahadev konar added a comment - travis, we have had a lot of discussion on load balancing. I'd really want to try and see how the disconnect and reconnect works for load balancing. I am also with you that it might be a good enough soln on load balancing. I can upload a simple patch for this. Would you have some bandwidth trying and it out and reporting how well it works?
        Hide
        Travis Crawford added a comment -

        Attached are two graphs showing:

        • Total ZooKeeper connections to a 3 node cluster
        • Connections per member in the cluster

        In the totals graph, notice how its largely unchanged over time. This period represents a steady-state period of usage.

        In the members graph, notice how the number of connections is significantly different between machines. This cluster allows the leader to service reads, so that's not something to factor in when interpreting number of connections.

        These graphs look very similar to an issue I had with another service (scribe) and we solved the issue by disconnecting every N+-K messages. We tried getting fancy by publishing load metrics and using a smart selection algorithm. Turns out in practice though the periodic disconnect/reconnect was easier to implement and worked better, so I'm tossing that idea out as a potential solution here.

        Show
        Travis Crawford added a comment - Attached are two graphs showing: Total ZooKeeper connections to a 3 node cluster Connections per member in the cluster In the totals graph, notice how its largely unchanged over time. This period represents a steady-state period of usage. In the members graph, notice how the number of connections is significantly different between machines. This cluster allows the leader to service reads, so that's not something to factor in when interpreting number of connections. These graphs look very similar to an issue I had with another service (scribe) and we solved the issue by disconnecting every N+-K messages. We tried getting fancy by publishing load metrics and using a smart selection algorithm. Turns out in practice though the periodic disconnect/reconnect was easier to implement and worked better, so I'm tossing that idea out as a potential solution here.

          People

          • Assignee:
            Mahadev konar
            Reporter:
            Travis Crawford
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Development