ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1177

Enabling a large number of watches for a large number of clients

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 3.3.3
    • Fix Version/s: 3.5.1
    • Component/s: server
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      Changes to the watch manager to support very large (200 million) watches. This change also improves the synchronization in the WatchManager to reduce the contention on various watch manager operations (mainly addWatch() which is a fairly common operation after trigger watch).
      Show
      Changes to the watch manager to support very large (200 million) watches. This change also improves the synchronization in the WatchManager to reduce the contention on various watch manager operations (mainly addWatch() which is a fairly common operation after trigger watch).

      Description

      In my ZooKeeper, I see watch manager consuming several GB of memory and I dug a bit deeper.

      In the scenario I am testing, I have 10K clients connected to an observer. There are about 20K znodes in ZooKeeper, each is about 1K - so about 20M data in total.
      Each client fetches and puts watches on all the znodes. That is 200 million watches.

      It seems a single watch takes about 100 bytes. I am currently at 14528037 watches and according to the yourkit profiler, WatchManager has 1.2 G already. This is not going to work as it might end up needing 20G of RAM just for the watches.

      So we need a more compact way of storing watches. Here are the possible solutions.
      1. Use a bitmap instead of the current hashmap. In this approach, each znode would get a unique id when its gets created. For every session, we can keep track of a bitmap that indicates the set of znodes this session is watching. A bitmap, assuming a 100K znodes, would be 12K. For 10K sessions, we can keep track of watches using 120M instead of 20G.
      2. This second idea is based on the observation that clients watch znodes in sets (for example all znodes under a folder). Multiple clients watch the same set and the total number of sets is a couple of orders of magnitude smaller than the total number of znodes. In my scenario, there are about 100 sets. So instead of keeping track of watches at the znode level, keep track of it at the set level. It may mean that get may also need to be implemented at the set level. With this, we can save the watches in 100M.

      Are there any other suggestions of solutions?

      Thanks

      1. ZOOKEEPER-1177.patch
        29 kB
        Patrick Hunt
      2. ZOOKEEPER-1177.patch
        20 kB
        Patrick Hunt
      3. ZooKeeper-with-fix-for-findbugs-warning.patch
        20 kB
        Vikas Mehta
      4. Zookeeper-after-resolving-merge-conflicts.patch
        20 kB
        Vikas Mehta
      5. ZooKeeper.patch
        20 kB
        Vikas Mehta

        Activity

        Patrick Hunt made changes -
        Fix Version/s 3.5.1 [ 12326786 ]
        Fix Version/s 3.5.0 [ 12316644 ]
        Thawan Kooburat made changes -
        Assignee Vishal Kathuria [ vishal.k ] Thawan Kooburat [ thawan ]
        Patrick Hunt made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        Patrick Hunt added a comment -

        Cancelling the patch while the performance issues are being addressed.

        Show
        Patrick Hunt added a comment - Cancelling the patch while the performance issues are being addressed.
        Hide
        Patrick Hunt added a comment -

        Zhihong - I searched around a bit but didn't find that, I'll take a look (it's ASL licensed), thanks!

        Show
        Patrick Hunt added a comment - Zhihong - I searched around a bit but didn't find that, I'll take a look (it's ASL licensed), thanks!
        Hide
        Ted Yu added a comment -

        Patrick mentioned lack of SparseBitSet.
        I found this: http://blog.rapleaf.com/dev/2010/12/17/memory-efficient-sparse-bitsets/ which leads to https://github.com/lemire/javaewah

        Show
        Ted Yu added a comment - Patrick mentioned lack of SparseBitSet. I found this: http://blog.rapleaf.com/dev/2010/12/17/memory-efficient-sparse-bitsets/ which leads to https://github.com/lemire/javaewah
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12509448/ZOOKEEPER-1177.patch
        against trunk revision 1227000.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 4 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/876//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/876//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/876//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12509448/ZOOKEEPER-1177.patch against trunk revision 1227000. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/876//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/876//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/876//console This message is automatically generated.
        Patrick Hunt made changes -
        Attachment ZOOKEEPER-1177.patch [ 12509448 ]
        Hide
        Patrick Hunt added a comment -

        Vikas, here's the updated patch with my perf tests, the easiest way I've found to run is:

        ant -Dtest.junit.maxmem=2g -Dtest.output=yes -Dtestcase=WatchManagerPerf clean test-core-java

        You mentioned earlier seeing contention issues with multiple reader/writer threads. I think you should augment my *Perf test with some addl MT tests so that we can all see the issues/results. Thanks!

        Show
        Patrick Hunt added a comment - Vikas, here's the updated patch with my perf tests, the easiest way I've found to run is: ant -Dtest.junit.maxmem=2g -Dtest.output=yes -Dtestcase=WatchManagerPerf clean test-core-java You mentioned earlier seeing contention issues with multiple reader/writer threads. I think you should augment my *Perf test with some addl MT tests so that we can all see the issues/results. Thanks!
        Hide
        Patrick Hunt added a comment -

        Vikas - can you incorporate Ben's idea as well - leave the old impl around, add yours as a new impl, new intf supported by both. That will simplify testing & perf verification. A system property that allows you to switch the impl would be great and could be used by the tests. I'll upload my perf test today.

        Show
        Patrick Hunt added a comment - Vikas - can you incorporate Ben's idea as well - leave the old impl around, add yours as a new impl, new intf supported by both. That will simplify testing & perf verification. A system property that allows you to switch the impl would be great and could be used by the tests. I'll upload my perf test today.
        Hide
        Vikas Mehta added a comment -

        That was a big miss on my part for the non-watched case. Thanks for performance analysis and catching this issue.

        I did attempt to keep the two-way hashmap in one of the implementation and at the time, the locking was still very coarse in WatchManager (esp. having to synchronize on the path when adding watch added a lot of contention as all the watchers set their watches on the path again after the watch was triggered).

        With the current locking approach, we can try the 2-way hash map again with the lock on the pathID only when updating watches on the path (pathID to watches map entry). I will try to make the new patch with 2-way locks ready in next few days.

        Patrick - can you send me your perf tests so that I can compare results.

        Thanks,
        Vikas

        Show
        Vikas Mehta added a comment - That was a big miss on my part for the non-watched case. Thanks for performance analysis and catching this issue. I did attempt to keep the two-way hashmap in one of the implementation and at the time, the locking was still very coarse in WatchManager (esp. having to synchronize on the path when adding watch added a lot of contention as all the watchers set their watches on the path again after the watch was triggered). With the current locking approach, we can try the 2-way hash map again with the lock on the pathID only when updating watches on the path (pathID to watches map entry). I will try to make the new patch with 2-way locks ready in next few days. Patrick - can you send me your perf tests so that I can compare results. Thanks, Vikas
        Hide
        Vishal Kathuria added a comment -

        Thanks for doing such a detailed performance analysis on this Patrick. There are no other changes to this code that we haven't yet contributed.

        Re @Daniel's suggestion - as far as I recall, Vikas tried that but we had to abandon that for locking contention in our version of ZK that has multi-threaded reads. Vikas: If you have the details from that trial, could you pls share them?

        Show
        Vishal Kathuria added a comment - Thanks for doing such a detailed performance analysis on this Patrick. There are no other changes to this code that we haven't yet contributed. Re @Daniel's suggestion - as far as I recall, Vikas tried that but we had to abandon that for locking contention in our version of ZK that has multi-threaded reads. Vikas: If you have the details from that trial, could you pls share them?
        Hide
        Patrick Hunt added a comment -

        @daniel – I was thinking the same thing over the weekend. We'd probably want to refactor a bit while we're doing it.

        @ben sounds reasonable. Esp if we want to enable performance comparison. We should also look at the locking overhead that you highlighted earlier.

        I was also planning to contribute my perf tests. I'll work on this further unless someone else wants to?

        vishal - have you made any other changes to this code not yet contributed, or issues found?

        Show
        Patrick Hunt added a comment - @daniel – I was thinking the same thing over the weekend. We'd probably want to refactor a bit while we're doing it. @ben sounds reasonable. Esp if we want to enable performance comparison. We should also look at the locking overhead that you highlighted earlier. I was also planning to contribute my perf tests. I'll work on this further unless someone else wants to? vishal - have you made any other changes to this code not yet contributed, or issues found?
        Hide
        Benjamin Reed added a comment -

        perhaps we should make WatchManager an interface so that we can select alternate implementations at runtime. that way we could also but more experimental implementations in without risking default installations.

        Show
        Benjamin Reed added a comment - perhaps we should make WatchManager an interface so that we can select alternate implementations at runtime. that way we could also but more experimental implementations in without risking default installations.
        Hide
        Daniel Gómez Ferro added a comment -

        I think we could use the same trick to keep a reverse index, that is, mapping each Watcher to a unique integer id and having a BitSet per watched path. We would double the memory usage, but I think its worth it.

        Show
        Daniel Gómez Ferro added a comment - I think we could use the same trick to keep a reverse index, that is, mapping each Watcher to a unique integer id and having a BitSet per watched path. We would double the memory usage, but I think its worth it.
        Hide
        Patrick Hunt added a comment -

        Good point. The performance is terrible but not entirely for the reason you mentioned (although that's contributing).

        with 10k watchers with 10k un-watched paths, all of the paths being triggered, the NEW case runs in ~6sec, while the old runs in ~50ms.

        The problem is the loop itself. I tried re-running the test with all of the locking turned off (incl making the table a hashmap rather than a concurrenthashmap) and while it does run faster it still takes ~3sec. I even turned off the entire "if (watchedPaths.get(pathId))" conditional/block and it runs about the same (3 sec).

        Doesn't seem like we should commit this with such poor performance for the common case. Perhaps we need a hybrid approach? I don't see how that could easily be done though.

        Show
        Patrick Hunt added a comment - Good point. The performance is terrible but not entirely for the reason you mentioned (although that's contributing). with 10k watchers with 10k un-watched paths, all of the paths being triggered, the NEW case runs in ~6sec, while the old runs in ~50ms. The problem is the loop itself. I tried re-running the test with all of the locking turned off (incl making the table a hashmap rather than a concurrenthashmap) and while it does run faster it still takes ~3sec. I even turned off the entire "if (watchedPaths.get(pathId))" conditional/block and it runs about the same (3 sec). Doesn't seem like we should commit this with such poor performance for the common case. Perhaps we need a hybrid approach? I don't see how that could easily be done though.
        Hide
        Benjamin Reed added a comment -

        i'm curious about the effect this has on the overall performance of non-watched operations. if you have 10K watchers. then you are going to acquire and release 10k locks in removeWatches for every write operation. correct?

        Show
        Benjamin Reed added a comment - i'm curious about the effect this has on the overall performance of non-watched operations. if you have 10K watchers. then you are going to acquire and release 10k locks in removeWatches for every write operation. correct?
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12508907/ZOOKEEPER-1177.patch
        against trunk revision 1225352.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 2 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/870//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/870//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/870//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508907/ZOOKEEPER-1177.patch against trunk revision 1225352. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/870//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/870//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/870//console This message is automatically generated.
        Patrick Hunt made changes -
        Attachment ZOOKEEPER-1177.patch [ 12508907 ]
        Hide
        Patrick Hunt added a comment -

        Added a couple minor tweaks to the patch (final fields and such).

        Show
        Patrick Hunt added a comment - Added a couple minor tweaks to the patch (final fields and such).
        Hide
        Patrick Hunt added a comment -

        In reviewing this code I noticed a couple concerns:

        1) it's unfortunate there is no SparseBitSet, in the worst case scenario we might end up wasting memory due to the id assignment. Say you had 1k sessions and 1m znodes. If the first session set watches on all 1m znodes and the other 999 sessions set a watch on znode 1m (the one with the highest id), the 999 sessions would have empty bitsets, save for the very last bit (#1m). Not sure if this is a real concern in practice though.

        2) more concerning is removeWatcher - it removes the bitset, but not the associated entries in path2Id/id2Path. Granted cleanup is an expensive operation in this case, however without it we'll end up "leaking" path2id/id2path mappings in some cases.

        1 doesn't seem like a blocker given the benefits, perhaps 2 is not a concern given in most cases the set of paths is bounded and if the znode is deleted the watch manager will cleanup at that time.

        Thoughts?

        Show
        Patrick Hunt added a comment - In reviewing this code I noticed a couple concerns: 1) it's unfortunate there is no SparseBitSet, in the worst case scenario we might end up wasting memory due to the id assignment. Say you had 1k sessions and 1m znodes. If the first session set watches on all 1m znodes and the other 999 sessions set a watch on znode 1m (the one with the highest id), the 999 sessions would have empty bitsets, save for the very last bit (#1m). Not sure if this is a real concern in practice though. 2) more concerning is removeWatcher - it removes the bitset, but not the associated entries in path2Id/id2Path. Granted cleanup is an expensive operation in this case, however without it we'll end up "leaking" path2id/id2path mappings in some cases. 1 doesn't seem like a blocker given the benefits, perhaps 2 is not a concern given in most cases the set of paths is bounded and if the znode is deleted the watch manager will cleanup at that time. Thoughts?
        Hide
        Patrick Hunt added a comment -

        I ran some ghetto performance numbers against this patch on trunk (NEW) vs without (OLD)

        I modified testSizeInBytes to create 10k watchers and 1k paths, each watcher is watching all the paths - 10m watches in total. (OLD failed with 10k/10k, even at 2g, while NEW ran fine with 512m)

        java version "1.6.0_26"
        Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
        Java HotSpot(TM) Server VM (build 20.1-b02, mixed mode)
        
        ant -Dtest.junit.maxmem=2g -Dtest.output=yes -Dtestcase=WatchManagerTest clean test-core-java
        
        add - add 10m watches
        size - run size one the manager
        dump - dump the watches to /dev/null (bypath and byid)
        trigger - trigger the 10m watches
        
        the numbers settled down to something like this after letting the VM warm up:
        
        NEW
            [junit] 1753ms to add
            [junit] size:10000000
            [junit] 1ms to size
            [junit] 3424ms to dumpwatches true
            [junit] 3066ms to dumpwatches false
            [junit] 2318ms to trigger
        OLD
            [junit] 9736ms to add
            [junit] size:10000000
            [junit] 0ms to size
            [junit] 5615ms to dumpwatches true
            [junit] 3035ms to dumpwatches false
            [junit] 5530ms to trigger
        
        notice:
        add - ~5 times faster
        size - approx the same, even though NEW is scanning all bitsets
        dump - faster for bypath, about the same for byid
        trigger - ~2 times faster
        

        here are the numbers with 1k watchers and 10k paths

        NEW
            [junit] 1219ms to add
            [junit] size:10000000
            [junit] 0ms to size
            [junit] 3527ms to dumpwatches true
            [junit] 3680ms to dumpwatches false
            [junit] 1426ms to trigger
        OLD
            [junit] 7020ms to add
            [junit] size:10000000
            [junit] 1ms to size
            [junit] 3585ms to dumpwatches true
            [junit] 3251ms to dumpwatches false
            [junit] 2843ms to trigger
        
        both old and NEW do better in this case than in the 10k/1k case. NEW is still significantly ahead of OLD.
        
        
        Show
        Patrick Hunt added a comment - I ran some ghetto performance numbers against this patch on trunk (NEW) vs without (OLD) I modified testSizeInBytes to create 10k watchers and 1k paths, each watcher is watching all the paths - 10m watches in total. (OLD failed with 10k/10k, even at 2g, while NEW ran fine with 512m) java version "1.6.0_26" Java(TM) SE Runtime Environment (build 1.6.0_26-b03) Java HotSpot(TM) Server VM (build 20.1-b02, mixed mode) ant -Dtest.junit.maxmem=2g -Dtest.output=yes -Dtestcase=WatchManagerTest clean test-core-java add - add 10m watches size - run size one the manager dump - dump the watches to /dev/null (bypath and byid) trigger - trigger the 10m watches the numbers settled down to something like this after letting the VM warm up: NEW [junit] 1753ms to add [junit] size:10000000 [junit] 1ms to size [junit] 3424ms to dumpwatches true [junit] 3066ms to dumpwatches false [junit] 2318ms to trigger OLD [junit] 9736ms to add [junit] size:10000000 [junit] 0ms to size [junit] 5615ms to dumpwatches true [junit] 3035ms to dumpwatches false [junit] 5530ms to trigger notice: add - ~5 times faster size - approx the same, even though NEW is scanning all bitsets dump - faster for bypath, about the same for byid trigger - ~2 times faster here are the numbers with 1k watchers and 10k paths NEW [junit] 1219ms to add [junit] size:10000000 [junit] 0ms to size [junit] 3527ms to dumpwatches true [junit] 3680ms to dumpwatches false [junit] 1426ms to trigger OLD [junit] 7020ms to add [junit] size:10000000 [junit] 1ms to size [junit] 3585ms to dumpwatches true [junit] 3251ms to dumpwatches false [junit] 2843ms to trigger both old and NEW do better in this case than in the 10k/1k case. NEW is still significantly ahead of OLD.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12503927/ZooKeeper-with-fix-for-findbugs-warning.patch
        against trunk revision 1225200.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/866//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/866//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/866//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12503927/ZooKeeper-with-fix-for-findbugs-warning.patch against trunk revision 1225200. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/866//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/866//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/866//console This message is automatically generated.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12503927/ZooKeeper-with-fix-for-findbugs-warning.patch
        against trunk revision 1202557.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/792//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/792//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/792//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12503927/ZooKeeper-with-fix-for-findbugs-warning.patch against trunk revision 1202557. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/792//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/792//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/792//console This message is automatically generated.
        Vikas Mehta made changes -
        Hide
        Vikas Mehta added a comment -

        Fixed the findbugs warning.

        Show
        Vikas Mehta added a comment - Fixed the findbugs warning.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12503831/Zookeeper-after-resolving-merge-conflicts.patch
        against trunk revision 1202360.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/791//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/791//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/791//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12503831/Zookeeper-after-resolving-merge-conflicts.patch against trunk revision 1202360. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/791//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/791//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/791//console This message is automatically generated.
        Vikas Mehta made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hadoop Flags Reviewed [ 10343 ]
        Release Note Changes to the watch manager to support very large (200 million) watches. This change also improves the synchronization in the WatchManager to reduce the contention on various watch manager operations (mainly addWatch() which is a fairly common operation after trigger watch).
        Vikas Mehta made changes -
        Hide
        Vikas Mehta added a comment -

        patch after incorporating review comments and resolving merge conflicts.

        Show
        Vikas Mehta added a comment - patch after incorporating review comments and resolving merge conflicts.
        Hide
        Patrick Hunt added a comment -

        Hi Vikas, yes, you can submit at any time. That just marks the issue as something that should be reviewed by a committer, typically for commit to svn. Alternately you can indicate in your submit message that the patch is a draft and you'd like initial f/b, etc... The main reason for marking as submit is to get QA bot to test the patch and to highlight to a committer that you'd like the patch to be committed. Alternately you could work work the community on the mailing list (refering to this jira/patch), that happens sometimes when people are in the very early stages (say the patch doesn't even compile yet).

        Show
        Patrick Hunt added a comment - Hi Vikas, yes, you can submit at any time. That just marks the issue as something that should be reviewed by a committer, typically for commit to svn. Alternately you can indicate in your submit message that the patch is a draft and you'd like initial f/b, etc... The main reason for marking as submit is to get QA bot to test the patch and to highlight to a committer that you'd like the patch to be committed. Alternately you could work work the community on the mailing list (refering to this jira/patch), that happens sometimes when people are in the very early stages (say the patch doesn't even compile yet).
        Hide
        Vikas Mehta added a comment -

        Is it ok for me to submit this patch?

        Thanks,
        Vikas

        Show
        Vikas Mehta added a comment - Is it ok for me to submit this patch? Thanks, Vikas
        Hide
        Vikas Mehta added a comment -

        Hi Patrick,

        Sorry for the late response. Let me know if you were looking for something different than what I answered:

        1) in your testing what is the impact of triggering a large number of watches on overall operation latency?

        [vikas] Without this change, with large number of watches, zookeeper would run out of memory storing all the watches. With this change, it is now bound by the network bandwidth. One difference between the current and the old implementation is that now triggerWatch() loops through all the watchers to find that watches it needs to trigger compared to using the reverse map in the previous version to prevent this scan for slight benefit in cases when number of watches per path is much smaller compared to number of watchers.

        2) Say I delete a znode in your example, that will trigger 10k notifications to be sent (one to each session) - what is the impact on the latency of this request (the delete), both with and without this patch?

        [vikas] Without the patch, as mentioned above we are not able to run zookeeper with so many watches. If we do not have too many watchers (or overall watches), impact with this change would be linear scan of the watchers to identify the watches that need to be triggered for the update/delete node operation.

        3) Subsequent to the investigations you've been doing, should we have concerns on overall service availability due to large numbers of watches being triggered concurrently?

        [vikas] We are thinking of implementing some throttling on the server (and later may be client-side as well) to prevent deterioration in the Zookeeper performance or availability.

        Thanks,
        Vikas

        Show
        Vikas Mehta added a comment - Hi Patrick, Sorry for the late response. Let me know if you were looking for something different than what I answered: 1) in your testing what is the impact of triggering a large number of watches on overall operation latency? [vikas] Without this change, with large number of watches, zookeeper would run out of memory storing all the watches. With this change, it is now bound by the network bandwidth. One difference between the current and the old implementation is that now triggerWatch() loops through all the watchers to find that watches it needs to trigger compared to using the reverse map in the previous version to prevent this scan for slight benefit in cases when number of watches per path is much smaller compared to number of watchers. 2) Say I delete a znode in your example, that will trigger 10k notifications to be sent (one to each session) - what is the impact on the latency of this request (the delete), both with and without this patch? [vikas] Without the patch, as mentioned above we are not able to run zookeeper with so many watches. If we do not have too many watchers (or overall watches), impact with this change would be linear scan of the watchers to identify the watches that need to be triggered for the update/delete node operation. 3) Subsequent to the investigations you've been doing, should we have concerns on overall service availability due to large numbers of watches being triggered concurrently? [vikas] We are thinking of implementing some throttling on the server (and later may be client-side as well) to prevent deterioration in the Zookeeper performance or availability. Thanks, Vikas
        Hide
        Patrick Hunt added a comment -

        This is awesome work guys, thanks.

        Question: in your testing what is the impact of triggering a large number of watches on overall operation latency?

        Each client fetches and puts watches on all the znodes.

        Say I delete a znode in your example, that will trigger 10k notifications to be sent (one to each session) - what is the impact on the latency of this request (the delete), both with and without this patch?

        Subsequent to the investigations you've been doing, should we have concerns on overall service availability due to large numbers of watches being triggered concurrently?

        Show
        Patrick Hunt added a comment - This is awesome work guys, thanks. Question: in your testing what is the impact of triggering a large number of watches on overall operation latency? Each client fetches and puts watches on all the znodes. Say I delete a znode in your example, that will trigger 10k notifications to be sent (one to each session) - what is the impact on the latency of this request (the delete), both with and without this patch? Subsequent to the investigations you've been doing, should we have concerns on overall service availability due to large numbers of watches being triggered concurrently?
        Hide
        Vikas Mehta added a comment -

        Created the reviewboard review:

        https://reviews.apache.org/r/2399/

        Also, I removed the unit test HACK comment, because it dosen't affect the test or the WatchManager changes in this case. I was a bit unhappy because in zookeeper you would always enter this function with a read lock, but in the unit test it doesn't (and it is not required to hold the read lock for this particular test).

        Show
        Vikas Mehta added a comment - Created the reviewboard review: https://reviews.apache.org/r/2399/ Also, I removed the unit test HACK comment, because it dosen't affect the test or the WatchManager changes in this case. I was a bit unhappy because in zookeeper you would always enter this function with a read lock, but in the unit test it doesn't (and it is not required to hold the read lock for this particular test).
        Hide
        Camille Fournier added a comment -

        This one probably deserves a reviewboard review, if you would please create one. At a glance, my eyebrows raise at comments like: // HACK: read the pathID without holding the read lock

        Show
        Camille Fournier added a comment - This one probably deserves a reviewboard review, if you would please create one. At a glance, my eyebrows raise at comments like: // HACK: read the pathID without holding the read lock
        Vikas Mehta made changes -
        Field Original Value New Value
        Attachment ZooKeeper.patch [ 12499063 ]
        Hide
        Vikas Mehta added a comment -

        Please review the attached patch to optimize the memory needed by WatchManager. This patch also optimizes the locking in WatchManager.

        With this optimization, memory consumed by WatchManager is under 100M for 200M watches.

        Show
        Vikas Mehta added a comment - Please review the attached patch to optimize the memory needed by WatchManager. This patch also optimizes the locking in WatchManager. With this optimization, memory consumed by WatchManager is under 100M for 200M watches.
        Hide
        Vishal Kathuria added a comment -

        Thanks for reviewing and your comments Mahadev. I was leaning towards 1 as well.

        There is one issue I am worried about if we use czxid as the index into the bitmap - it will keep growing over time as the recently created nodes will have a the latest zxid as the value. The bitmap will become sparse and we will need some sort of bitmap compression scheme.

        The bitmap compression would increase the cost of bit lookup.

        I had another thought about allocating an id for a znode without causing a change in the storage format.

        We could allocate the node id for a znode as we are deserializing the data tree during restore at startup time. All these ids would be sequential so the max value of the node id would be the same as the total number of nodes.

        For the new node create after startup, we can keep allocating node ids using the same counter.

        The only other thing we will have to design for is reusing node ids as the znodes are deleted and created - this should be easy by keeping a bitmap of free node id values.

        Thanks

        Show
        Vishal Kathuria added a comment - Thanks for reviewing and your comments Mahadev. I was leaning towards 1 as well. There is one issue I am worried about if we use czxid as the index into the bitmap - it will keep growing over time as the recently created nodes will have a the latest zxid as the value. The bitmap will become sparse and we will need some sort of bitmap compression scheme. The bitmap compression would increase the cost of bit lookup. I had another thought about allocating an id for a znode without causing a change in the storage format. We could allocate the node id for a znode as we are deserializing the data tree during restore at startup time. All these ids would be sequential so the max value of the node id would be the same as the total number of nodes. For the new node create after startup, we can keep allocating node ids using the same counter. The only other thing we will have to design for is reusing node ids as the znodes are deleted and created - this should be easy by keeping a bitmap of free node id values. Thanks
        Hide
        Mahadev konar added a comment -

        Vishal,
        I like idea 1). Each znode has a a txn id with which its created, which is anyway unique for each znode (look at czxid in stat). We can use this for uniquely identifying the znode. Doing 1) would be pretty easy.

        Show
        Mahadev konar added a comment - Vishal, I like idea 1). Each znode has a a txn id with which its created, which is anyway unique for each znode (look at czxid in stat). We can use this for uniquely identifying the znode. Doing 1) would be pretty easy.
        Vishal Kathuria created issue -

          People

          • Assignee:
            Thawan Kooburat
            Reporter:
            Vishal Kathuria
          • Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

            • Created:
              Updated:

              Development