Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Fix Version/s: 3.0.0 rc2
    • Component/s: Configuration
    • Labels:
      None

      Description

      Reference discussion on CASSANDRA-7486.

      For smaller heap sizes G1 appears to have some throughput/latency issues when compared to CMS. With our default max heap size at 8G on 3.0, there's a strong argument to be made for having CMS as the default for the 3.0 release.

        Issue Links

          Activity

          Hide
          sivann Spiros Ioannou added a comment -

          Stefano Ortolani We reverted to G1 and everything is fine since, no performance hit, all repairs work fine since.

          Show
          sivann Spiros Ioannou added a comment - Stefano Ortolani We reverted to G1 and everything is fine since, no performance hit, all repairs work fine since.
          Hide
          ostefano Stefano Ortolani added a comment -

          Just a note here, even with 8G heap, some GC can take very long (8cpu machine, 2TB data size). While typical is about 220ms, it can take > 20sec.

          Not by this much, but I have seen similar pauses (~10secs) on 12cpu machine, 200GB data, LCS, 50MB max partition size, while doing inc repairs (RF=3, pause was hitting all three nodes). Currently exploring migrating from CMS to G1 for this reason.

          Show
          ostefano Stefano Ortolani added a comment - Just a note here, even with 8G heap, some GC can take very long (8cpu machine, 2TB data size). While typical is about 220ms, it can take > 20sec. Not by this much, but I have seen similar pauses (~10secs) on 12cpu machine, 200GB data, LCS, 50MB max partition size, while doing inc repairs (RF=3, pause was hitting all three nodes). Currently exploring migrating from CMS to G1 for this reason.
          Hide
          sivann Spiros Ioannou added a comment -

          Just a note here, even with 8G heap, some GC can take very long (8cpu machine, 2TB data size). While typical is about 220ms, it can take > 20sec.

          WARN [Service Thread] 2017-05-02 10:44:08,772 GCInspector.java:282 - ConcurrentMarkSweep GC in 20335ms. CMS Old Gen: 4251698688 -> 2490135992; Par Eden Space: 671088640 -> 0; ...

          That happens especially during repairs, and makes other nodes drop the one doing the long GC, and makes almost all repairs fail. Using G1 seems a valid option for production.

          Show
          sivann Spiros Ioannou added a comment - Just a note here, even with 8G heap, some GC can take very long (8cpu machine, 2TB data size). While typical is about 220ms, it can take > 20sec. WARN [Service Thread] 2017-05-02 10:44:08,772 GCInspector.java:282 - ConcurrentMarkSweep GC in 20335ms. CMS Old Gen: 4251698688 -> 2490135992; Par Eden Space: 671088640 -> 0; ... That happens especially during repairs, and makes other nodes drop the one doing the long GC, and makes almost all repairs fail. Using G1 seems a valid option for production.
          Hide
          JoshuaMcKenzie Joshua McKenzie added a comment -

          Sounds good. Committed as 1415fa512a21b933f89f8ff25b3fd12cfbbbf4cb

          Show
          JoshuaMcKenzie Joshua McKenzie added a comment - Sounds good. Committed as 1415fa512a21b933f89f8ff25b3fd12cfbbbf4cb
          Hide
          mshuler Michael Shuler added a comment -
          Show
          mshuler Michael Shuler added a comment - also see CASSANDRA-10251 and CASSANDRA-10212
          Hide
          pauloricardomg Paulo Motta added a comment - - edited

          For some reason, cassandra-env.sh is called before loading debian/cassandra.in.sh on debian/init. Since the init script already defines CASSANDRA_HOME, I assume it's safe to also define CASSANDRA_CONF in a similar way. Another option would be to load cassandra.in.sh before sourcing cassandra-env.sh, but I don't think this should be necessary because we're dealing with pre-defined package directories (eg. /usr/share/cassandra and /etc/cassandra), so it should be safe to define them in a static way. It's always better to confirm though.

          Show
          pauloricardomg Paulo Motta added a comment - - edited For some reason, cassandra-env.sh is called before loading debian/cassandra.in.sh on debian/init . Since the init script already defines CASSANDRA_HOME , I assume it's safe to also define CASSANDRA_CONF in a similar way. Another option would be to load cassandra.in.sh before sourcing cassandra-env.sh , but I don't think this should be necessary because we're dealing with pre-defined package directories (eg. /usr/share/cassandra and /etc/cassandra), so it should be safe to define them in a static way. It's always better to confirm though.
          Hide
          zznate Nate McCall added a comment -

          That would be Brandon Williams if he is willing to admit such.

          CASSANDRA_CONF is already set in debian/cassandra.in.sh though, so it should be redundant.

          Show
          zznate Nate McCall added a comment - That would be Brandon Williams if he is willing to admit such. CASSANDRA_CONF is already set in debian/cassandra.in.sh though, so it should be redundant.
          Hide
          JoshuaMcKenzie Joshua McKenzie added a comment -
          debian/init
          +CASSANDRA_CONF=$CONFDIR
          

          I'm wary of this change - are there ramifications outside just the scope of the jvm.options file to us changing debian/init in this way?

          Michael Shuler: Are you the one that's most knowledgeable about the debian init/install scripts? If so, what are your thoughts?

          Show
          JoshuaMcKenzie Joshua McKenzie added a comment - debian/init +CASSANDRA_CONF=$CONFDIR I'm wary of this change - are there ramifications outside just the scope of the jvm.options file to us changing debian/init in this way? Michael Shuler : Are you the one that's most knowledgeable about the debian init/install scripts? If so, what are your thoughts?
          Hide
          pauloricardomg Paulo Motta added a comment -

          Are we intending to drop jvm.options into /etc rather than /etc/cassandra? That seems off to me.

          That was a mistake during manual reconstruction of the nuked commit, the correct is /etc/cassandra/jvm.options. Updated 10403 branch with correcton. Sorry for the confusion!

          During packaged install test, found that hints directory was not created correctly so created CASSANDRA-10525 to address that. After fixing that, switched from CMS to G1 on /etc/cassandra/jvm.options and then restarting service reloaded options correctly, so seems to work well.

          Show
          pauloricardomg Paulo Motta added a comment - Are we intending to drop jvm.options into /etc rather than /etc/cassandra? That seems off to me. That was a mistake during manual reconstruction of the nuked commit, the correct is /etc/cassandra/jvm.options . Updated 10403 branch with correcton. Sorry for the confusion! During packaged install test, found that hints directory was not created correctly so created CASSANDRA-10525 to address that. After fixing that, switched from CMS to G1 on /etc/cassandra/jvm.options and then restarting service reloaded options correctly, so seems to work well.
          Hide
          JoshuaMcKenzie Joshua McKenzie added a comment -

          So after working through a force-push nuking the .ps1 change on your branch (addressed offline), my last question:

          ...
          conf/logback.xml etc/cassandra
          conf/logback-tools.xml etc/cassandra
          conf/jvm.options etc/jvm.options
          ...
          

          Are we intending to drop jvm.options into /etc rather than /etc/cassandra? That seems off to me. I haven't tested the cassandra.install changes - I assume you tested this and it's working?

          Show
          JoshuaMcKenzie Joshua McKenzie added a comment - So after working through a force-push nuking the .ps1 change on your branch (addressed offline), my last question: ... conf/logback.xml etc/cassandra conf/logback-tools.xml etc/cassandra conf/jvm.options etc/jvm.options ... Are we intending to drop jvm.options into /etc rather than /etc/cassandra? That seems off to me. I haven't tested the cassandra.install changes - I assume you tested this and it's working?
          Hide
          pauloricardomg Paulo Motta added a comment -

          the GC logging options in jvm.options won't work if users un-comment those lines as-is.

          My bad, I forgot to convert the gc logging options from the old format to the new format, so they should work now if uncommented. thanks for spotting that!

          Also added conf/jvm.options file to debian/cassandra.install

          Show
          pauloricardomg Paulo Motta added a comment - the GC logging options in jvm.options won't work if users un-comment those lines as-is. My bad, I forgot to convert the gc logging options from the old format to the new format, so they should work now if uncommented. thanks for spotting that! Also added conf/jvm.options file to debian/cassandra.install
          Hide
          JoshuaMcKenzie Joshua McKenzie added a comment - - edited

          Sorry for the delay - thought we were waiting on the answer re: package install.

          .ps1 changes look good and test correctly. Only one thing left: the GC logging options in jvm.options won't work on Windows if users un-comment those lines as-is. We should probably have a little logic in cassandra-env.* to parse out GC logging options and a simple flag to enable it in the jvm.options file. This shouldn't necessitate a change to the regex/StartsWith checks since the lines start with JVM_OPTIONS, so it should be pretty clean.

          Once that's ironed out, I'll commit. (edit: after the debian/cassandra.install changes too)

          Show
          JoshuaMcKenzie Joshua McKenzie added a comment - - edited Sorry for the delay - thought we were waiting on the answer re: package install. .ps1 changes look good and test correctly. Only one thing left: the GC logging options in jvm.options won't work on Windows if users un-comment those lines as-is. We should probably have a little logic in cassandra-env.* to parse out GC logging options and a simple flag to enable it in the jvm.options file. This shouldn't necessitate a change to the regex/StartsWith checks since the lines start with JVM_OPTIONS, so it should be pretty clean. Once that's ironed out, I'll commit. (edit: after the debian/cassandra.install changes too)
          Hide
          tjake T Jake Luciani added a comment - - edited

          Do package installs need any tweaking to support the jvm.options file or anything in the /conf dir is already supported?

          The new jvm.options file needs to be added to debian/cassandra.install

          Show
          tjake T Jake Luciani added a comment - - edited Do package installs need any tweaking to support the jvm.options file or anything in the /conf dir is already supported? The new jvm.options file needs to be added to debian/cassandra.install
          Hide
          jjirsa Jeff Jirsa added a comment -

          Being somewhat related to that anecdote (since it's my employer and I touch those clusters every day), I'll point out that the cluster described in that talk was tuned by Albert P Tobey, and when we tried to apply a similar config to our 2.0 cluster, we started seeing a ton of OOMs with 2.0.x + G1 + 8G heap. We eventually reverted, because we understood CMS better (already setup with the 8150 tunings), and we were able to mitigate G1 OOMs on that 2.0 cluster with CMS. I think the reality is that both CMS and G1 need to be tuned - in our case, we had tuned CMS on that 2.0 cluster. +1 to having G1 marked as experimental but easy to enable.

          Show
          jjirsa Jeff Jirsa added a comment - Being somewhat related to that anecdote (since it's my employer and I touch those clusters every day), I'll point out that the cluster described in that talk was tuned by Albert P Tobey , and when we tried to apply a similar config to our 2.0 cluster, we started seeing a ton of OOMs with 2.0.x + G1 + 8G heap. We eventually reverted, because we understood CMS better (already setup with the 8150 tunings), and we were able to mitigate G1 OOMs on that 2.0 cluster with CMS. I think the reality is that both CMS and G1 need to be tuned - in our case, we had tuned CMS on that 2.0 cluster. +1 to having G1 marked as experimental but easy to enable.
          Show
          jshook Jonathan Shook added a comment - Anecdote: https://www.youtube.com/watch?v=1R-mgOcOSd4&feature=youtu.be&t=24m27s
          Hide
          pauloricardomg Paulo Motta added a comment -

          Thanks for the review! Follow-up below:

          • Updated branch with windows support to jvm.options file on cassandra-env.ps1
          • Removed legacy label from CMS settings on jvm.options file
          • .bat already had CMS settings there
          • I don't think it's necessary to have an alert/fail when no gc settings are present on the jvm.options file, since that's a valid option (default gc settings are picked), and if the user chooses to edit the file manually to comment the GC options he better know what he's doing.
          • Replaced G1 NEWS.txt entry with entry explaining gc options were moved to jvm.options file
          • Follow-up ticket created: CASSANDRA-10494

          Do package installs need any tweaking to support the jvm.options file or anything in the /conf dir is already supported?

          Show
          pauloricardomg Paulo Motta added a comment - Thanks for the review! Follow-up below: Updated branch with windows support to jvm.options file on cassandra-env.ps1 Removed legacy label from CMS settings on jvm.options file .bat already had CMS settings there I don't think it's necessary to have an alert/fail when no gc settings are present on the jvm.options file, since that's a valid option (default gc settings are picked), and if the user chooses to edit the file manually to comment the GC options he better know what he's doing. Replaced G1 NEWS.txt entry with entry explaining gc options were moved to jvm.options file Follow-up ticket created: CASSANDRA-10494 Do package installs need any tweaking to support the jvm.options file or anything in the /conf dir is already supported?
          Hide
          JoshuaMcKenzie Joshua McKenzie added a comment -

          I like the changes; having our GC settings in their own file cleans up quite a bit.

          Some feedback:

          • CMS is labeled as legacy and G1 as experimental. I'd remove the "legacy mode" label from CMS since it's our recommended GC for 3.0 at this time
          • Consider checking whether users have un-commented both GC types and alert / fail to start on that configuration.
          • Need to roll the .bat implementation back to CMS when you do the windows integration into the .ps1. I'm fine not pursuing making the .bat file parse the jvm.options file since we strongly recommend the .ps1 launch scripts and alert on lack of permissions to use them.

          I'd like to see a follow-up ticket for 3.X where we unify all our JVM_OPTS in the jvm.options file, not just GC statics (assuming your comment about GC statics wasn't just a typo: if you mean JVM_OPTS statics I'm in complete agreement).

          Looking good.

          Show
          JoshuaMcKenzie Joshua McKenzie added a comment - I like the changes; having our GC settings in their own file cleans up quite a bit. Some feedback: CMS is labeled as legacy and G1 as experimental. I'd remove the "legacy mode" label from CMS since it's our recommended GC for 3.0 at this time Consider checking whether users have un-commented both GC types and alert / fail to start on that configuration. Need to roll the .bat implementation back to CMS when you do the windows integration into the .ps1. I'm fine not pursuing making the .bat file parse the jvm.options file since we strongly recommend the .ps1 launch scripts and alert on lack of permissions to use them. I'd like to see a follow-up ticket for 3.X where we unify all our JVM_OPTS in the jvm.options file, not just GC statics (assuming your comment about GC statics wasn't just a typo: if you mean JVM_OPTS statics I'm in complete agreement). Looking good.
          Hide
          pauloricardomg Paulo Motta added a comment -

          It seems the 100x test failed again.

          Given rc2 will be out soon, and we don't have sufficient elements to ensure G1 is yet a good default GC for 3.0, I opted to play it safe and revert to our 2.2 CMS settings while making G1 easy to enable.

          While looking for a way to make it trivial to enable G1, I noticed that although we're pretty accustomed to cassandra-env.sh, it's not very user-friendly to require users to modify a script to change GC/JVM options (is it a script, or a configuration file?). Furthermore, JVM options are duplicated across cassandra-env.sh and cassandra-env.ps1, so every change to our default JVM options require modifying both files, what could potentially lead to options being updated in one place and not another, etc.

          I thought this could be a good opportunity to extract JVM options to its own portable and platform-independent configuration file, where users could easily tune JVM/GC options without having to maintain a copy of a script that could be incompatible between different cassandra versions, etc. While searching for options I found a simple format where you basically have one JVM option per line, which makes it easy to be parsed by both cassandra-env.sh and cassandra-env.ps1. In order not to go out of the scope of this ticket too much, I initially ported only the Heap and GC options, and will move other static GC options in the context of another ticket, if there are no objections.

          So, given this jvm.options file, changing GCs is a matter of commenting out the default CMS settings and uncommenting the G1 settings:

          ### CMS Settings (legacy mode, enabled by default)
          -XX:+UseParNewGC
          -XX:+UseConcMarkSweepGC
          -XX:+CMSParallelRemarkEnabled
          -XX:SurvivorRatio=8
          -XX:MaxTenuringThreshold=1
          -XX:CMSInitiatingOccupancyFraction=75
          -XX:+UseCMSInitiatingOccupancyOnly
          -XX:CMSWaitDuration=10000
          -XX:+CMSParallelInitialMarkEnabled
          -XX:+CMSEdenChunksRecordAlways
          
          ### G1 Settings (experimental, comment previous section and uncomment section below to enable)
          ## Use the Hotspot garbage-first collector.
          #-XX:+UseG1GC
          #
          ## Have the JVM do less remembered set work during STW, instead
          ## preferring concurrent GC. Reduces p99.9 latency.
          #-XX:G1RSetUpdatingPauseTimePercent=5
          #
          ## Main G1GC tunable: lowering the pause target will lower throughput and vise versa.
          ## 200ms is the JVM default and lowest viable setting
          ## 1000ms increases throughput. Keep it smaller than the timeouts in cassandra.yaml.
          #-XX:MaxGCPauseMillis=500
          

          The full version of the cassandra jvm.options file is available on this gist. I initially adapted only cassandra-env.sh to work with the new file. After the review, if there are no major changes, I will adapt cassandra-env.ps1 to work with the new jvm.options file.

          Patch is avaialble for review here.

          Tests:

          Show
          pauloricardomg Paulo Motta added a comment - It seems the 100x test failed again. Given rc2 will be out soon, and we don't have sufficient elements to ensure G1 is yet a good default GC for 3.0, I opted to play it safe and revert to our 2.2 CMS settings while making G1 easy to enable. While looking for a way to make it trivial to enable G1, I noticed that although we're pretty accustomed to cassandra-env.sh , it's not very user-friendly to require users to modify a script to change GC/JVM options (is it a script, or a configuration file?). Furthermore, JVM options are duplicated across cassandra-env.sh and cassandra-env.ps1 , so every change to our default JVM options require modifying both files, what could potentially lead to options being updated in one place and not another, etc. I thought this could be a good opportunity to extract JVM options to its own portable and platform-independent configuration file, where users could easily tune JVM/GC options without having to maintain a copy of a script that could be incompatible between different cassandra versions, etc. While searching for options I found a simple format where you basically have one JVM option per line, which makes it easy to be parsed by both cassandra-env.sh and cassandra-env.ps1 . In order not to go out of the scope of this ticket too much, I initially ported only the Heap and GC options, and will move other static GC options in the context of another ticket, if there are no objections. So, given this jvm.options file, changing GCs is a matter of commenting out the default CMS settings and uncommenting the G1 settings: ### CMS Settings (legacy mode, enabled by default) -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSWaitDuration=10000 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways ### G1 Settings (experimental, comment previous section and uncomment section below to enable) ## Use the Hotspot garbage-first collector. #-XX:+UseG1GC # ## Have the JVM do less remembered set work during STW, instead ## preferring concurrent GC. Reduces p99.9 latency. #-XX:G1RSetUpdatingPauseTimePercent=5 # ## Main G1GC tunable: lowering the pause target will lower throughput and vise versa. ## 200ms is the JVM default and lowest viable setting ## 1000ms increases throughput. Keep it smaller than the timeouts in cassandra.yaml. #-XX:MaxGCPauseMillis=500 The full version of the cassandra jvm.options file is available on this gist . I initially adapted only cassandra-env.sh to work with the new file. After the review, if there are no major changes, I will adapt cassandra-env.ps1 to work with the new jvm.options file. Patch is avaialble for review here . Tests: testall dtest
          Hide
          jshook Jonathan Shook added a comment - - edited

          To simplify, implementing CASSANDRA-10425 is effectively the same as reverting for the systems that we have commonly tested for, while allowing a likely better starting point for those that we have field experience with G1.

          Show
          jshook Jonathan Shook added a comment - - edited To simplify, implementing CASSANDRA-10425 is effectively the same as reverting for the systems that we have commonly tested for, while allowing a likely better starting point for those that we have field experience with G1.
          Hide
          jshook Jonathan Shook added a comment - - edited

          Joshua McKenzie
          I understand and appreciate the need to control scoping effort for 3.0 planning.

          Shouldn't the read/write workload distribution also play into that?

          Yes, but there is a mostly orthogonal effect to the nuances of the workload mix which has to do with the vertical scalability of GC when the system is more fully utilized. This is visible along the sizing spectrum. Run the same workload and try to scale the heap proportionally over the memory (1/4 or whatever) and you will likely see CMS suffer no matter what. This is slightly conjectural, but easily verifiable with some effort.

          the idea of having a default that's optimal for everyone is unrealistic

          I think we are converging on a common perspective on this.

          Sylvain Lebresne

          3.2 will come only 2 months after 3.0

          My preference would be to have the CASSANDRA-10425 out of the gate, but this still would require some testing effort for safety. The reason being that 3.0 represents a reframing of performance expectations, and after that, any changes to default, even for larger memory systems constitute a bigger chance of surprise. Do we have a chance to learn about sizing from surveys, etc before the runway ends for 3.0?

          If we could get something like CASSANDRA-10425 in place, it would cover both bases.

          Show
          jshook Jonathan Shook added a comment - - edited Joshua McKenzie I understand and appreciate the need to control scoping effort for 3.0 planning. Shouldn't the read/write workload distribution also play into that? Yes, but there is a mostly orthogonal effect to the nuances of the workload mix which has to do with the vertical scalability of GC when the system is more fully utilized. This is visible along the sizing spectrum. Run the same workload and try to scale the heap proportionally over the memory (1/4 or whatever) and you will likely see CMS suffer no matter what. This is slightly conjectural, but easily verifiable with some effort. the idea of having a default that's optimal for everyone is unrealistic I think we are converging on a common perspective on this. Sylvain Lebresne 3.2 will come only 2 months after 3.0 My preference would be to have the CASSANDRA-10425 out of the gate, but this still would require some testing effort for safety. The reason being that 3.0 represents a reframing of performance expectations, and after that, any changes to default, even for larger memory systems constitute a bigger chance of surprise. Do we have a chance to learn about sizing from surveys, etc before the runway ends for 3.0? If we could get something like CASSANDRA-10425 in place, it would cover both bases.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          I agree that we should revert in 3.0.0 (while making it trivial to enable), and revisit for 3.2.

          Show
          iamaleksey Aleksey Yeschenko added a comment - I agree that we should revert in 3.0.0 (while making it trivial to enable), and revisit for 3.2.
          Hide
          slebresne Sylvain Lebresne added a comment -

          My 2 cents here is: let's revert to CMS for 3.0 (while tweaking our startup script so it's easier to switch setting as suggested above, that's a good idea and it should be trivial) since that's been our default forever and it's clear we haven't experimented enough yet to be 100% sure that G1 is a clearly better default and so we should err on the side of conservatism. We absolutely should continue to experiment on this and refine our defaults and we now have CASSANDRA-10425 for that. But as far as 3.0 is concerned, I would venture that using our available resource for more testing (and things like testing performance of clusters during upgrade to 3.0 to take just one example) would be a better choice that spending it on more experimentation on this. Even if G1 turns out to be an overall better default, 3.2 will come only 2 months after 3.0.

          Show
          slebresne Sylvain Lebresne added a comment - My 2 cents here is: let's revert to CMS for 3.0 (while tweaking our startup script so it's easier to switch setting as suggested above, that's a good idea and it should be trivial) since that's been our default forever and it's clear we haven't experimented enough yet to be 100% sure that G1 is a clearly better default and so we should err on the side of conservatism. We absolutely should continue to experiment on this and refine our defaults and we now have CASSANDRA-10425 for that. But as far as 3.0 is concerned, I would venture that using our available resource for more testing (and things like testing performance of clusters during upgrade to 3.0 to take just one example) would be a better choice that spending it on more experimentation on this. Even if G1 turns out to be an overall better default, 3.2 will come only 2 months after 3.0.
          Hide
          JoshuaMcKenzie Joshua McKenzie added a comment -

          I'd prefer not to make too many assumptions about confirmation or (human) memory bias on this. We will not get off this carousel without actual data.

          Fair point, and I agree on the fact that all this needs further exploration. Unfortunately we have neither infinite time nor resources to get ready for 3.0, so there's a reduction in scope as to what this ticket's trying to solve.

          if we simply align the GC settings to the type of hardware that they work well for.

          Shouldn't the read/write workload distribution also play into that?

          So after being a PITA and devil's advocate on this ticket, the end perspective I come down to is: there's a bunch of different workloads and a bunch of different hardware that C* runs on, and the idea of having a default that's optimal for everyone is unrealistic. It may very well be that G1 is a better "good enough" default for most distributions, large heap or no, and that's the conversation on IRC that led Jonathan to his comment on the other ticket to go with it.

          JEP 248 seems to imply that Oracle thinks that's the case.

          Within the scope of this ticket, pending the 100x results (if they're inline with 10x implications), I'd be comfortable using this ticket as an opportunity to add back the 2.2 settings commented out with some extra context for the workloads we expect those settings to excel on and keeping G1 the default.

          Paulo Motta: could you elaborate on some of the pain points you ran into with an 8G heap and G1?

          Show
          JoshuaMcKenzie Joshua McKenzie added a comment - I'd prefer not to make too many assumptions about confirmation or (human) memory bias on this. We will not get off this carousel without actual data. Fair point, and I agree on the fact that all this needs further exploration. Unfortunately we have neither infinite time nor resources to get ready for 3.0, so there's a reduction in scope as to what this ticket's trying to solve. if we simply align the GC settings to the type of hardware that they work well for. Shouldn't the read/write workload distribution also play into that? So after being a PITA and devil's advocate on this ticket, the end perspective I come down to is: there's a bunch of different workloads and a bunch of different hardware that C* runs on, and the idea of having a default that's optimal for everyone is unrealistic. It may very well be that G1 is a better "good enough" default for most distributions, large heap or no, and that's the conversation on IRC that led Jonathan to his comment on the other ticket to go with it. JEP 248 seems to imply that Oracle thinks that's the case. Within the scope of this ticket, pending the 100x results (if they're inline with 10x implications), I'd be comfortable using this ticket as an opportunity to add back the 2.2 settings commented out with some extra context for the workloads we expect those settings to excel on and keeping G1 the default. Paulo Motta : could you elaborate on some of the pain points you ran into with an 8G heap and G1?
          Hide
          jshook Jonathan Shook added a comment -

          I created CASSANDRA-10425 to discuss the per-size defaults.

          Show
          jshook Jonathan Shook added a comment - I created CASSANDRA-10425 to discuss the per-size defaults.
          Hide
          jshook Jonathan Shook added a comment -

          Joshua McKenzie I'd prefer not to make too many assumptions about confirmation or (human) memory bias on this. We will not get off this carousel without actual data. However, to the degree that you are right about it, it should encourage us to explore further, not less. CMS's pain in those cases has much to do with its inability to scale with hardware sizing and concurrency trends, which we seem to be working really hard to disregard. Until someone puts together a view of current and emerging system parameters, we really don't have the data that we need to set a default.

          I posit that the general case system is much bigger in practice that in the past. I also posit that on those systems, G1 is an obviously better default than CMS. So, we are likely going to get some data on 1) what the hardware data looks like in the field and 2) whether or not we can demonstrate the CMS improvements with larger memory that we've seen with actual workloads on current system profiles. I'm simply eager to see more data at this point.

          This is a bit out of scope of the ticket, but it is important. If we were able to set a default depending on the available memory, there would not be a single default. Trying to scale GC bandwidth up on bigger metal with CMS is arguably more painful than trying to make G1 useable with lower memory. However, we don't have to make that bargain as either-or. We can have the best of both, if we simply align the GC settings to the type of hardware that they work well for.

          I'll create another ticket for that.

          Show
          jshook Jonathan Shook added a comment - Joshua McKenzie I'd prefer not to make too many assumptions about confirmation or (human) memory bias on this. We will not get off this carousel without actual data. However, to the degree that you are right about it, it should encourage us to explore further, not less. CMS's pain in those cases has much to do with its inability to scale with hardware sizing and concurrency trends, which we seem to be working really hard to disregard. Until someone puts together a view of current and emerging system parameters, we really don't have the data that we need to set a default. I posit that the general case system is much bigger in practice that in the past. I also posit that on those systems, G1 is an obviously better default than CMS. So, we are likely going to get some data on 1) what the hardware data looks like in the field and 2) whether or not we can demonstrate the CMS improvements with larger memory that we've seen with actual workloads on current system profiles . I'm simply eager to see more data at this point. This is a bit out of scope of the ticket, but it is important. If we were able to set a default depending on the available memory, there would not be a single default. Trying to scale GC bandwidth up on bigger metal with CMS is arguably more painful than trying to make G1 useable with lower memory. However, we don't have to make that bargain as either-or. We can have the best of both, if we simply align the GC settings to the type of hardware that they work well for. I'll create another ticket for that.
          Hide
          enigmacurry Ryan McGuire added a comment - - edited

          We enabled the use of different stress branches, so we should be able to run it now.

          I'll restart the job that benedict ran: http://cstar.datastax.com/tests/id/01b714d8-5f32-11e5-a4e4-42010af0688f

          I'll use Benedict's stress branch from: https://github.com/belliottsmith/cassandra/tree/stress-report-interval

          EDIT: test started: http://cstar.datastax.com/tests/id/ac36457a-67a0-11e5-8666-42010af0688f

          Show
          enigmacurry Ryan McGuire added a comment - - edited We enabled the use of different stress branches, so we should be able to run it now. I'll restart the job that benedict ran: http://cstar.datastax.com/tests/id/01b714d8-5f32-11e5-a4e4-42010af0688f I'll use Benedict 's stress branch from: https://github.com/belliottsmith/cassandra/tree/stress-report-interval EDIT: test started: http://cstar.datastax.com/tests/id/ac36457a-67a0-11e5-8666-42010af0688f
          Hide
          JoshuaMcKenzie Joshua McKenzie added a comment - - edited

          The debate here isn't between "Do we consider G1 or disregard G1 entirely as the general case", but rather "Do we use CMS or G1 as our default GC for 3.0, keeping the settings for the alternative available in our configuration files for users to swap between easily".

          Edit: missed the "general case" phrase. To that point, I haven't heard of actual data showing what the general case is for C* usage is in the wild regarding heap size. I would expect that large heap deployments will be disproportionately reflected in the memories of people in the field since CMS causes pain at that point.

          Ryan McGuire: did we ever get that 100x test run to complete successfully? Before we determine whether Paulo should go fully down this rabbit-hole pre 3.0 releasing, I'd like that data point since we invested some effort into getting that running already.

          Edit:

          Show
          JoshuaMcKenzie Joshua McKenzie added a comment - - edited The debate here isn't between "Do we consider G1 or disregard G1 entirely as the general case", but rather "Do we use CMS or G1 as our default GC for 3.0, keeping the settings for the alternative available in our configuration files for users to swap between easily". Edit: missed the "general case" phrase. To that point, I haven't heard of actual data showing what the general case is for C* usage is in the wild regarding heap size. I would expect that large heap deployments will be disproportionately reflected in the memories of people in the field since CMS causes pain at that point. Ryan McGuire : did we ever get that 100x test run to complete successfully? Before we determine whether Paulo should go fully down this rabbit-hole pre 3.0 releasing, I'd like that data point since we invested some effort into getting that running already. Edit:
          Hide
          jshook Jonathan Shook added a comment -

          Paulo Motta I understand, with your updated comment.
          For systems that can't support a larger heap, CMS is fine, as long as you don't mind saturating survivor and triggering the cascade of GC-induced side-effects. Still, this is a performance trade-off with resiliency.

          I want to be clear that I think it would be a loss for us to just disregard G1 for larger memory systems as the general case. There seems to be some tension between the actual field experience and prognostication as to how it should work. I would like for data to lead the way on this, as it should.

          Show
          jshook Jonathan Shook added a comment - Paulo Motta I understand, with your updated comment. For systems that can't support a larger heap, CMS is fine, as long as you don't mind saturating survivor and triggering the cascade of GC-induced side-effects. Still, this is a performance trade-off with resiliency. I want to be clear that I think it would be a loss for us to just disregard G1 for larger memory systems as the general case. There seems to be some tension between the actual field experience and prognostication as to how it should work. I would like for data to lead the way on this, as it should.
          Hide
          jshook Jonathan Shook added a comment -

          So, just to be clear, We are we disregarding G1 for systems with larger memory with the assumption that 8GB is all you'll ever need for "all but the most write-heavy workoads", even for system that have larger memory ??

          Show
          jshook Jonathan Shook added a comment - So, just to be clear, We are we disregarding G1 for systems with larger memory with the assumption that 8GB is all you'll ever need for "all but the most write-heavy workoads", even for system that have larger memory ??
          Hide
          jshook Jonathan Shook added a comment -

          To be clear, in some cases, we found G1 to be a better production GC, and those tests simply allowed us to verify this before leaving it in place.

          Show
          jshook Jonathan Shook added a comment - To be clear, in some cases, we found G1 to be a better production GC, and those tests simply allowed us to verify this before leaving it in place.
          Hide
          jshook Jonathan Shook added a comment -

          This statement carries certain assumptions about the whole system, which may not be fair across the board. For example, buffer cache is a critical consideration, but to a varying degree depending on how cache-friendly the workload is. Further, the storage subsystem determines a very large part of how much of a cache-miss penalty there is. So, prioritizing the cache at the expense of the heap is not a sure win. Often it is not the right balance.

          With system that have high concurrency, it is possible to scale up the performance on the node as long as you can provide reasonable tunings to effectively take advantage of available resources without critically bottle-necking on one. For example, with systems that have higher effective IO concurrency and IO bandwidth across many devices, you actually need higher GC throughput in order to match the overall IO capacity of the system, from storage subsystem all the way to the network stack.

          This rationale has been evidenced in the field when we have made tuning improvements with G1 in certain systems as an opportunistic test. My explanation above is a probably a gross oversimplification, but it reflects experience addressing GC throughput (and pauses, and phi, and hints, and load shifting ... etc) issues.

          Show
          jshook Jonathan Shook added a comment - This statement carries certain assumptions about the whole system, which may not be fair across the board. For example, buffer cache is a critical consideration, but to a varying degree depending on how cache-friendly the workload is. Further, the storage subsystem determines a very large part of how much of a cache-miss penalty there is. So, prioritizing the cache at the expense of the heap is not a sure win. Often it is not the right balance. With system that have high concurrency, it is possible to scale up the performance on the node as long as you can provide reasonable tunings to effectively take advantage of available resources without critically bottle-necking on one. For example, with systems that have higher effective IO concurrency and IO bandwidth across many devices, you actually need higher GC throughput in order to match the overall IO capacity of the system, from storage subsystem all the way to the network stack. This rationale has been evidenced in the field when we have made tuning improvements with G1 in certain systems as an opportunistic test. My explanation above is a probably a gross oversimplification, but it reflects experience addressing GC throughput (and pauses, and phi, and hints, and load shifting ... etc) issues.
          Hide
          pauloricardomg Paulo Motta added a comment - - edited

          I will hold back the experiments a bit to work on some other ickets until we decide whether it's worth comparing CMS vs G1 for small heap sizes in the context of this ticket, or we can just decide based on discussion.

          edit: seems I pressed enter without completing the message before, sorry for that.

          Show
          pauloricardomg Paulo Motta added a comment - - edited I will hold back the experiments a bit to work on some other ickets until we decide whether it's worth comparing CMS vs G1 for small heap sizes in the context of this ticket, or we can just decide based on discussion. edit: seems I pressed enter without completing the message before, sorry for that.
          Hide
          benedict Benedict added a comment -

          For context, that statement was based on a system with 32-64Gb of memory.

          Show
          benedict Benedict added a comment - For context, that statement was based on a system with 32-64Gb of memory.
          Hide
          JoshuaMcKenzie Joshua McKenzie added a comment -

          From Benedict's comment on CASSANDRA-7486:

          for anything but the most write-heavy workloads an 8Gb heap is probably what you'll want to ensure the file cache can make meaningful contributions to performance.

          Jonathan Shook / Pavel Yaskevich: Does your experience running C* in production back up that assertion? It makes theoretical sense to me and would change the weighing of large heap vs. small tuning in the JVM / our selection of defaults if so.

          Show
          JoshuaMcKenzie Joshua McKenzie added a comment - From Benedict 's comment on CASSANDRA-7486 : for anything but the most write-heavy workloads an 8Gb heap is probably what you'll want to ensure the file cache can make meaningful contributions to performance. Jonathan Shook / Pavel Yaskevich : Does your experience running C* in production back up that assertion? It makes theoretical sense to me and would change the weighing of large heap vs. small tuning in the JVM / our selection of defaults if so.
          Hide
          jshook Jonathan Shook added a comment - - edited

          Note about memory sizes. Everything I wrote above assumes that we are talking about smaller heaps. Things clearly change when we go up in heap size beyond what CMS can handle well. (for those reading from the middle)

          Show
          jshook Jonathan Shook added a comment - - edited Note about memory sizes. Everything I wrote above assumes that we are talking about smaller heaps. Things clearly change when we go up in heap size beyond what CMS can handle well. (for those reading from the middle)
          Hide
          pauloricardomg Paulo Motta added a comment -

          Interesting, thanks for your perspective.

          The reason I opted for a custom test on EC2, rather than using cstar (which only gives us raw performance results), is to test different workloads and check performance consistency over time with both GCs using the default settings (in addition to the raw performance results). I will also install opscenter to gather some metrics as the test progresses, so if you have some particular metric you'd be interested to see please let me know so I'll make sure to enable them.

          Show
          pauloricardomg Paulo Motta added a comment - Interesting, thanks for your perspective. The reason I opted for a custom test on EC2, rather than using cstar (which only gives us raw performance results), is to test different workloads and check performance consistency over time with both GCs using the default settings (in addition to the raw performance results). I will also install opscenter to gather some metrics as the test progresses, so if you have some particular metric you'd be interested to see please let me know so I'll make sure to enable them.
          Hide
          jshook Jonathan Shook added a comment -

          I do think it is valid, however I expect the findings to be slightly different. The promise of G1 on smaller systems is more robust performance across a range of workloads without manual tuning. That said, it probably won't perform as well in terms of ops/s, etc. The question to me is really whether we are trying to save people from the pain of not going fast enough or whether we are trying to save them from the pain of a CMS once they start having cascading IO and heap pressure through the system. I am very curious about our tests proving this out as we would expect.

          As an operator and a developer, I'd take an easily tuned and stable setting over one that goes fast until it doesn't go, any day. However, some will have already adjusted their cluster sizing around one expectation, so we'd want to make sure to avoid surprises. With 3.0 having other changes as well to offset, it might be a wash.

          Raw performance is only part of the picture. I would like to see your results, for sure.

          Show
          jshook Jonathan Shook added a comment - I do think it is valid, however I expect the findings to be slightly different. The promise of G1 on smaller systems is more robust performance across a range of workloads without manual tuning. That said, it probably won't perform as well in terms of ops/s, etc. The question to me is really whether we are trying to save people from the pain of not going fast enough or whether we are trying to save them from the pain of a CMS once they start having cascading IO and heap pressure through the system. I am very curious about our tests proving this out as we would expect. As an operator and a developer, I'd take an easily tuned and stable setting over one that goes fast until it doesn't go, any day. However, some will have already adjusted their cluster sizing around one expectation, so we'd want to make sure to avoid surprises. With 3.0 having other changes as well to offset, it might be a wash. Raw performance is only part of the picture. I would like to see your results, for sure.
          Hide
          pauloricardomg Paulo Motta added a comment -

          So, tests that go up to 32G of heap on a system with 64GB of main memory are really where the proof points are. Saturating loads are good too.

          As far as I understood the purpose of this ticket is to evaluate whether it makes sense to ship 3.0 with G1 (considering a default heap size of 8GB), so we should limit the heap experiments to 8GB heap size. If this understanding is correct, it's out of the scope of this ticket to re-evaluate the default heap size, or provide dynamic heap/gc settings based (although it will be nice to perform more thorough experiments and add support to dynamic profiles in a separate ticket after GA). For this reason I opted for m1.xlarge instances, which is the equivalent of a m3.xlarge, which is the smallest instance someone would deploy C* on production, but with disks to create additional heap pressure due to I/O exhaustion.

          We are quite confident G1 is better for heaps > 24GB, but the question we want to answer on this ticket is: is G1 the best option for a default 8GB heap? Given this, do you think it's still valid to perform these experiments on an m1.xlarge?

          Show
          pauloricardomg Paulo Motta added a comment - So, tests that go up to 32G of heap on a system with 64GB of main memory are really where the proof points are. Saturating loads are good too. As far as I understood the purpose of this ticket is to evaluate whether it makes sense to ship 3.0 with G1 (considering a default heap size of 8GB), so we should limit the heap experiments to 8GB heap size. If this understanding is correct, it's out of the scope of this ticket to re-evaluate the default heap size, or provide dynamic heap/gc settings based (although it will be nice to perform more thorough experiments and add support to dynamic profiles in a separate ticket after GA). For this reason I opted for m1.xlarge instances, which is the equivalent of a m3.xlarge, which is the smallest instance someone would deploy C* on production, but with disks to create additional heap pressure due to I/O exhaustion. We are quite confident G1 is better for heaps > 24GB, but the question we want to answer on this ticket is: is G1 the best option for a default 8GB heap? Given this, do you think it's still valid to perform these experiments on an m1.xlarge?
          Hide
          jshook Jonathan Shook added a comment -

          To be fair, m1.xlarge has less than 16GB of RAM, which still on the small side for G1 effectiveness, although at some point between 14G and 24G you should start seeing G1 provide more stability than CMS for GC saturating loads. (Assuming you don't set the GC pause target down too low)
          G1 should start to be the obvious choice when you run with more than about 24GB and even more obviously with 32GB of heap. This might seem large, but if you look at what businesses tend to deploy in data centers for bare metal, they aren't just 32GB systems anymore. You'll often see 64, 128, or more GB of DRAM. There are some other ec2 profiles which get up to this range, but they are disproportionately more expensive.

          So, tests that go up to 32G of heap on a system with 64GB of main memory are really where the proof points are. Saturating loads are good too.

          Show
          jshook Jonathan Shook added a comment - To be fair, m1.xlarge has less than 16GB of RAM, which still on the small side for G1 effectiveness, although at some point between 14G and 24G you should start seeing G1 provide more stability than CMS for GC saturating loads. (Assuming you don't set the GC pause target down too low) G1 should start to be the obvious choice when you run with more than about 24GB and even more obviously with 32GB of heap. This might seem large, but if you look at what businesses tend to deploy in data centers for bare metal, they aren't just 32GB systems anymore. You'll often see 64, 128, or more GB of DRAM. There are some other ec2 profiles which get up to this range, but they are disproportionately more expensive. So, tests that go up to 32G of heap on a system with 64GB of main memory are really where the proof points are. Saturating loads are good too.
          Hide
          pauloricardomg Paulo Motta added a comment -

          I will perform some saturation tests on trunk comparing default CMS settings (from 2.2) vs default G1 settings (from 3.0) with read, write and mixed workloads with the latest oracle jvm8. I will also perform operations such as repairs and bootstraps during the benchmarks in order to simulate a production-ish environment.

          Since the previous tests were done with SSD boxes, I will start the experiments with m1.xlarge spindle instances in order to exhaust disk capacity more easily and generate additional heap pressure, so things will go more insane. What do you think Joshua McKenzie? Please let me know if this sounds good, or if any additional scenarios should be covered.

          If there are clear impacts on these experiments, then we can consider rolling back and provide commented-out G1 options for larger heap sizes. Otherwise, we might as well just leave G1 enabled by default, since there was already a lot of discussion (as well as experiments on larger heap sizes, non-default GC settings and steady-state workloads) on CASSANDRA-7486.

          Show
          pauloricardomg Paulo Motta added a comment - I will perform some saturation tests on trunk comparing default CMS settings (from 2.2) vs default G1 settings (from 3.0) with read, write and mixed workloads with the latest oracle jvm8. I will also perform operations such as repairs and bootstraps during the benchmarks in order to simulate a production-ish environment. Since the previous tests were done with SSD boxes, I will start the experiments with m1.xlarge spindle instances in order to exhaust disk capacity more easily and generate additional heap pressure, so things will go more insane. What do you think Joshua McKenzie ? Please let me know if this sounds good, or if any additional scenarios should be covered. If there are clear impacts on these experiments, then we can consider rolling back and provide commented-out G1 options for larger heap sizes. Otherwise, we might as well just leave G1 enabled by default, since there was already a lot of discussion (as well as experiments on larger heap sizes, non-default GC settings and steady-state workloads) on CASSANDRA-7486 .
          Hide
          JoshuaMcKenzie Joshua McKenzie added a comment -

          Adding extra configuration files w/options to switch on launch is something I'd be comfortable with us adding after GA so long as we leave our default alone. For this ticket, let's focus on just determining whether or not we feel reverting from G1 to CMS is appropriate for 3.0, and then move forward on a separate ticket for adding more intelligence to our GC configuration sourcing options.

          For the record and my .02, I quite like the idea of us having multiple GC profiles out of the box with either logic to switch based on available heap, or via command-line for different expected workloads for instance; I think there's a lot we could do there to make operators' lives easier.

          Ryan McGuire: Any update on how that 100x test went?

          Show
          JoshuaMcKenzie Joshua McKenzie added a comment - Adding extra configuration files w/options to switch on launch is something I'd be comfortable with us adding after GA so long as we leave our default alone. For this ticket, let's focus on just determining whether or not we feel reverting from G1 to CMS is appropriate for 3.0, and then move forward on a separate ticket for adding more intelligence to our GC configuration sourcing options. For the record and my .02, I quite like the idea of us having multiple GC profiles out of the box with either logic to switch based on available heap, or via command-line for different expected workloads for instance; I think there's a lot we could do there to make operators' lives easier. Ryan McGuire : Any update on how that 100x test went?
          Hide
          jshook Jonathan Shook added a comment -

          I would be entirely in favor of having a separate settings file that can simply be sourced in. Having several related GC options sprinkled through the -env file is bothersome. This should apply as well to the CMS settings. Perhaps it should even be a soft setting, as long as the possible values are marshaled against any injection.

          Show
          jshook Jonathan Shook added a comment - I would be entirely in favor of having a separate settings file that can simply be sourced in. Having several related GC options sprinkled through the -env file is bothersome. This should apply as well to the CMS settings. Perhaps it should even be a soft setting, as long as the possible values are marshaled against any injection.
          Hide
          pauloricardomg Paulo Motta added a comment -

          +1. As an operator I had some issues with an 8GB heap and G1GC. We should probably make it easy to switch by extracting gc properties to a variable, and provide a commented-out option with pre-filled G1 settings, and maybe mention something on the documentation too.

          Show
          pauloricardomg Paulo Motta added a comment - +1. As an operator I had some issues with an 8GB heap and G1GC. We should probably make it easy to switch by extracting gc properties to a variable, and provide a commented-out option with pre-filled G1 settings, and maybe mention something on the documentation too.
          Hide
          JoshuaMcKenzie Joshua McKenzie added a comment -

          To me, the long-term solution of C* having the intelligence to select G1 for heaps over X, CMS for heaps under X makes a lot of sense, assuming test data shows that to be the appropriate solution.

          I'd argue that what we need to do here is figure out what the sanest recommendation is for a default GC on 3.0, get that setup in our launch scripts (if necessary), and probably include the alternate set of GC configurations in our launch files, commented out, so people can easily swap back and forth based on their needs.

          Show
          JoshuaMcKenzie Joshua McKenzie added a comment - To me, the long-term solution of C* having the intelligence to select G1 for heaps over X, CMS for heaps under X makes a lot of sense, assuming test data shows that to be the appropriate solution. I'd argue that what we need to do here is figure out what the sanest recommendation is for a default GC on 3.0, get that setup in our launch scripts (if necessary), and probably include the alternate set of GC configurations in our launch files, commented out, so people can easily swap back and forth based on their needs.
          Hide
          jshook Jonathan Shook added a comment -

          Can we get some G1 tests with a 24+G heap to see if it's worth making this machine-specific? The notion of "commodity" changes with time. The settings need to adapt if possible.

          Show
          jshook Jonathan Shook added a comment - Can we get some G1 tests with a 24+G heap to see if it's worth making this machine-specific? The notion of "commodity" changes with time. The settings need to adapt if possible.

            People

            • Assignee:
              pauloricardomg Paulo Motta
              Reporter:
              JoshuaMcKenzie Joshua McKenzie
              Reviewer:
              Joshua McKenzie
            • Votes:
              0 Vote for this issue
              Watchers:
              21 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development