Cassandra
  1. Cassandra
  2. CASSANDRA-6488

Batchlog writes consume unnecessarily large amounts of CPU on vnodes clusters

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 1.2.13, 2.0.4
    • Component/s: None
    • Labels:
      None

      Description

      The cloneTokenOnlyMap call in StorageProxy.getBatchlogEndpoints causes enormous amounts of CPU to be consumed on clusters with many vnodes. I created a patch to cache this data as a workaround and deployed it to a production cluster with 15,000 tokens. CPU consumption drop to 1/5th. This highlights the overall issues with cloneOnlyTokenMap() calls on vnodes clusters. I'm including the maybe-not-the-best-quality workaround patch to use as a reference, but cloneOnlyTokenMap is a systemic issue and every place it's called should probably be investigated.

      1. 6488-fix.txt
        4 kB
        Aleksey Yeschenko
      2. 6488-rbranson-patch.txt
        5 kB
        Rick Branson
      3. 6488-v2.txt
        6 kB
        Jonathan Ellis
      4. 6488-v3.txt
        10 kB
        Aleksey Yeschenko
      5. graph (21).png
        41 kB
        Rick Branson

        Activity

        Hide
        Rick Branson added a comment -

        CPU usage dropping on a production cluster after the attached patch is rolled out.

        Show
        Rick Branson added a comment - CPU usage dropping on a production cluster after the attached patch is rolled out.
        Hide
        Jonathan Ellis added a comment -

        v2 to move the caching logic inside cloneOnlyTokenMap

        Show
        Jonathan Ellis added a comment - v2 to move the caching logic inside cloneOnlyTokenMap
        Hide
        Jonathan Ellis added a comment -

        NB: I'm not sure what the changes to candidates/chosenEndpoints do so I've left that out for now.

        Show
        Jonathan Ellis added a comment - NB: I'm not sure what the changes to candidates/chosenEndpoints do so I've left that out for now.
        Hide
        Aleksey Yeschenko added a comment -

        v3 merges both and has some minor (stylistic) changes to SP on top.

        Show
        Aleksey Yeschenko added a comment - v3 merges both and has some minor (stylistic) changes to SP on top.
        Hide
        Aleksey Yeschenko added a comment -

        Committed in 4be9e6720d9f94a83aa42153c3e71ae1e557d2d9.

        Show
        Aleksey Yeschenko added a comment - Committed in 4be9e6720d9f94a83aa42153c3e71ae1e557d2d9.
        Hide
        Michael Shuler added a comment -

        This introduced a failure in BootStrapperTest:

        test:
             [echo] running unit tests
            [mkdir] Created dir: /home/mshuler/git/cassandra/build/test/cassandra
            [mkdir] Created dir: /home/mshuler/git/cassandra/build/test/output
            [junit] WARNING: multiple versions of ant detected in path for junit 
            [junit]          jar:file:/usr/share/ant/lib/ant.jar!/org/apache/tools/ant/Project.class
            [junit]      and jar:file:/home/mshuler/git/cassandra/build/lib/jars/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
            [junit] Testsuite: org.apache.cassandra.dht.BootStrapperTest
            [junit] Tests run: 4, Failures: 1, Errors: 0, Time elapsed: 6.177 sec
            [junit] 
            [junit] ------------- Standard Error -----------------
            [junit]  WARN 09:47:46,135 No host ID found, created 9019bb70-4d6e-4cf6-b730-140ff5ae4be5 (Note: This should happen exactly once per node).
            [junit]  WARN 09:47:46,262 Generated random token [d9180feb2e806704effa4024e8f4c631]. Random tokens will result in an unbalanced ring; see http://wiki.apache.org/cassandra/Operations
            [junit] ------------- ---------------- ---------------
            [junit] Testcase: testSourceTargetComputation(org.apache.cassandra.dht.BootStrapperTest):   FAILED
            [junit] expected:<1> but was:<0>
            [junit] junit.framework.AssertionFailedError: expected:<1> but was:<0>
            [junit]     at org.apache.cassandra.dht.BootStrapperTest.testSourceTargetComputation(BootStrapperTest.java:212)
            [junit]     at org.apache.cassandra.dht.BootStrapperTest.testSourceTargetComputation(BootStrapperTest.java:173)
            [junit] 
            [junit] 
            [junit] Test org.apache.cassandra.dht.BootStrapperTest FAILED
        
        BUILD FAILED
        /home/mshuler/git/cassandra/build.xml:1113: The following error occurred while executing this line:
        /home/mshuler/git/cassandra/build.xml:1078: Some unit test(s) failed.
        
        Total time: 9 seconds
        ((4be9e67...)|BISECTING)mshuler@hana:~/git/cassandra$ git bisect bad
        4be9e6720d9f94a83aa42153c3e71ae1e557d2d9 is the first bad commit
        commit 4be9e6720d9f94a83aa42153c3e71ae1e557d2d9
        Author: Aleksey Yeschenko <aleksey@apache.org>
        Date:   Sun Dec 15 13:29:56 2013 +0300
        
            Improve batchlog write performance with vnodes
            
            patch by Jonathan Ellis and Rick Branson; reviewed by Aleksey Yeschenko
            for CASSANDRA-6488
        
        :100644 100644 e5865925f160faabc2506c3a5aac9985c17c1658 b55393b2ed138011bab52f95f2e9b52107709938 M      CHANGES.txt
        :040000 040000 dea10aa8044e10eb60002e75f2586a9c8e94b647 7030c09f9713bd3e342e4e012c59b09c86b79a42 M      src
        
        Show
        Michael Shuler added a comment - This introduced a failure in BootStrapperTest: test: [echo] running unit tests [mkdir] Created dir: /home/mshuler/git/cassandra/build/test/cassandra [mkdir] Created dir: /home/mshuler/git/cassandra/build/test/output [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/usr/share/ant/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:file:/home/mshuler/git/cassandra/build/lib/jars/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Testsuite: org.apache.cassandra.dht.BootStrapperTest [junit] Tests run: 4, Failures: 1, Errors: 0, Time elapsed: 6.177 sec [junit] [junit] ------------- Standard Error ----------------- [junit] WARN 09:47:46,135 No host ID found, created 9019bb70-4d6e-4cf6-b730-140ff5ae4be5 (Note: This should happen exactly once per node). [junit] WARN 09:47:46,262 Generated random token [d9180feb2e806704effa4024e8f4c631]. Random tokens will result in an unbalanced ring; see http: //wiki.apache.org/cassandra/Operations [junit] ------------- ---------------- --------------- [junit] Testcase: testSourceTargetComputation(org.apache.cassandra.dht.BootStrapperTest): FAILED [junit] expected:<1> but was:<0> [junit] junit.framework.AssertionFailedError: expected:<1> but was:<0> [junit] at org.apache.cassandra.dht.BootStrapperTest.testSourceTargetComputation(BootStrapperTest.java:212) [junit] at org.apache.cassandra.dht.BootStrapperTest.testSourceTargetComputation(BootStrapperTest.java:173) [junit] [junit] [junit] Test org.apache.cassandra.dht.BootStrapperTest FAILED BUILD FAILED /home/mshuler/git/cassandra/build.xml:1113: The following error occurred while executing this line: /home/mshuler/git/cassandra/build.xml:1078: Some unit test(s) failed. Total time: 9 seconds ((4be9e67...)|BISECTING)mshuler@hana:~/git/cassandra$ git bisect bad 4be9e6720d9f94a83aa42153c3e71ae1e557d2d9 is the first bad commit commit 4be9e6720d9f94a83aa42153c3e71ae1e557d2d9 Author: Aleksey Yeschenko <aleksey@apache.org> Date: Sun Dec 15 13:29:56 2013 +0300 Improve batchlog write performance with vnodes patch by Jonathan Ellis and Rick Branson; reviewed by Aleksey Yeschenko for CASSANDRA-6488 :100644 100644 e5865925f160faabc2506c3a5aac9985c17c1658 b55393b2ed138011bab52f95f2e9b52107709938 M CHANGES.txt :040000 040000 dea10aa8044e10eb60002e75f2586a9c8e94b647 7030c09f9713bd3e342e4e012c59b09c86b79a42 M src
        Hide
        Michael Shuler added a comment -

        I'm working on the cassandra-2.0 branch, since I didn't mention it above. Around the same time, LeaveAndBootstrapTest, MoveTest, and RelocateTest were new failures - I'm looking at those

        Show
        Michael Shuler added a comment - I'm working on the cassandra-2.0 branch, since I didn't mention it above. Around the same time, LeaveAndBootstrapTest, MoveTest, and RelocateTest were new failures - I'm looking at those http://cassci.datastax.com/job/cassandra-2.0_test/49/console
        Hide
        Aleksey Yeschenko added a comment -

        So, the caching part. Jonathan Ellis can you have a look? If not, I will, later, but it's potentially 1.2.13 vote-affecting.

        Show
        Aleksey Yeschenko added a comment - So, the caching part. Jonathan Ellis can you have a look? If not, I will, later, but it's potentially 1.2.13 vote-affecting.
        Hide
        Michael Shuler added a comment -

        Commit bb09d3c fully passed all the unit tests in cassandra-2.0 branch.

        Show
        Michael Shuler added a comment - Commit bb09d3c fully passed all the unit tests in cassandra-2.0 branch. http://cassci.datastax.com/job/cassandra-2.0_test/47/console
        Hide
        Michael Shuler added a comment - - edited

        Those same tests look like new failures with this commit in cassandra-1.2 branch also

        (edit for clarity) New unit test failures in c-2.0 and c-1.2 branches with this commit:

        • BootStrapperTest
        • LeaveAndBootstrapTest
        • MoveTest
        • RelocateTest
        Show
        Michael Shuler added a comment - - edited Those same tests look like new failures with this commit in cassandra-1.2 branch also http://cassci.datastax.com/job/cassandra-1.2_test/32/console vs. http://cassci.datastax.com/job/cassandra-1.2_test/33/console (edit for clarity) New unit test failures in c-2.0 and c-1.2 branches with this commit: BootStrapperTest LeaveAndBootstrapTest MoveTest RelocateTest
        Hide
        Aleksey Yeschenko added a comment -

        Separates TM.cloneOnlyTokenMap() and TM.cachedOnlyTokenMap() and only switched SP.getBatchlogEndpoints() and ARS.getNaturalEndpoints() to use the cached version.

        They aren't the only methods that don't mutate the returned metadata, but going through the rest of the usages and optimizing those can wait.

        Also fixes a regression from 6435 in TM.cachedOnlyTokenMap().

        Show
        Aleksey Yeschenko added a comment - Separates TM.cloneOnlyTokenMap() and TM.cachedOnlyTokenMap() and only switched SP.getBatchlogEndpoints() and ARS.getNaturalEndpoints() to use the cached version. They aren't the only methods that don't mutate the returned metadata, but going through the rest of the usages and optimizing those can wait. Also fixes a regression from 6435 in TM.cachedOnlyTokenMap().
        Hide
        Jonathan Ellis added a comment -

        updated comments and committed

        Show
        Jonathan Ellis added a comment - updated comments and committed
        Hide
        Michael Shuler added a comment - - edited

        cassandra-1.2 branch, commit 13348c4, is passing these 4 unit tests:

        cassandra-2.0 is passing these, also

        Thanks all!

        Show
        Michael Shuler added a comment - - edited cassandra-1.2 branch, commit 13348c4, is passing these 4 unit tests: http://cassci.datastax.com/job/cassandra-1.2_test/35/console cassandra-2.0 is passing these, also http://cassci.datastax.com/job/cassandra-2.0_test/50/console Thanks all!

          People

          • Assignee:
            Rick Branson
            Reporter:
            Rick Branson
            Reviewer:
            Aleksey Yeschenko
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development