Details

    • Type: Improvement Improvement
    • Status: Reopened
    • Priority: Major Major
    • Resolution: Unresolved
    • Fix Version/s: None
    • Component/s: Configuration
    • Labels:
      None

      Description

      It's been found that the old twitter recommendations of 100m per core up to 800m is harmful and should no longer be used.

      Instead the formula used should be 1/3 or 1/4 max heap with a max of 2G. 1/3 or 1/4 is debatable and I'm open to suggestions. If I were to hazard a guess 1/3 is probably better for releases greater than 2.1.

      1. upload.png
        144 kB
        Matt Stump

        Activity

        Hide
        Brandon Williams added a comment -

        We already use 1/4 since 2.0.

        Show
        Brandon Williams added a comment - We already use 1/4 since 2.0.
        Hide
        Matt Stump added a comment - - edited

        That's not quite true:
        https://github.com/apache/cassandra/blob/trunk/conf/cassandra-env.sh#L75-L77

        We'll use the min((100 * cores), (1/4 * max_heap))

        Show
        Matt Stump added a comment - - edited That's not quite true: https://github.com/apache/cassandra/blob/trunk/conf/cassandra-env.sh#L75-L77 We'll use the min((100 * cores), (1/4 * max_heap))
        Hide
        Matt Stump added a comment -

        I'm going to advocate strongly for 40-50% by default for eden.

        Additionally I'm going to advocate that we change the following:

        • Increase ceiling on MAX_HEAP to 20G.
        • Increasing the MaxTenuringThreshold to 6 or 8
        • Include Instagram CMS enhancements.
        • Increase thread/core affinity for GC
        • Set XX:ParallelGCThreads and XX:ConcGCThreads to min(20, number of cores).
        • Possibly, set XX:MaxGCPauseMillis to 20ms, but I haven't really tested this one.

        Instagram CMS settings:
        JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
        JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=60000"
        JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=30000"

        Thread/core affinity settings:
        JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
        JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
        JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
        JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768"

        I've seen decreased GC pause frequency, latency, and an increase in throughput using the above recommendations with customer installations using the above recommendations. This was observed with both read heavy and balanced workloads.

        +Rick Branson +T Jake Luciani

        Show
        Matt Stump added a comment - I'm going to advocate strongly for 40-50% by default for eden. Additionally I'm going to advocate that we change the following: Increase ceiling on MAX_HEAP to 20G. Increasing the MaxTenuringThreshold to 6 or 8 Include Instagram CMS enhancements. Increase thread/core affinity for GC Set XX:ParallelGCThreads and XX:ConcGCThreads to min(20, number of cores). Possibly, set XX:MaxGCPauseMillis to 20ms, but I haven't really tested this one. Instagram CMS settings: JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark" JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=60000" JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=30000" Thread/core affinity settings: JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions" JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity" JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs" JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768" I've seen decreased GC pause frequency, latency, and an increase in throughput using the above recommendations with customer installations using the above recommendations. This was observed with both read heavy and balanced workloads. + Rick Branson + T Jake Luciani
        Hide
        Matt Stump added a comment -

        I wanted to add more evidence and exposition for the above recommendations so that my argument can be better understood.

        Young generation GC is pretty simple. The new heap is broken down into three segments, eden, and two survivor spaces s0 and s1. All new objects are allocated in eden, and once eden reaches a size threshold a minor GC is triggered. After the minor GC all surviving objects are moved to one of the survivor spaces, only one survivor space is active at a time. At the same time that GC for eden is triggered a GC for the active survivor space is also triggered. All live objects from both eden and the active survivor space are copied to the other survivor space and both eden and the previously active survivor space is wiped clean. Objects will bounce between the different survivor spaces until the MaxTenuringThreshold is hit (default in C* is 1). Once an object survives MaxTenuringThreshold number of collections it's copied to the tenured space which is governed by a different collector, in our instance CMS, but it could just as easily be G1. This act of copying is called promotion. The promotion from young generation to tenured space is what takes a long time. So if you see long ParNew GC pauses it's because many objects are being promoted. You decrease ParNew collection times by decreasing promotion.

        What can cause many objects to be promoted? It's objects that have survived both the initial eden space collection and MaxTenuringThreshold number of collections in the survivor space. The main tunables are the size of the various spaces in young gen, and the MaxTenuringThreshold. By increasing the young generation space it decreases the frequency at which we have to run GC because more objects can accumulate before we reach 75% capacity. By increasing the young generation and the MaxTenuringThreshold you give the short lived objects more time to die, and dead objects don't get promoted.

        The vast majority of objects in C* are ephemeral short lived objects. The only thing that should live in tenured space is the key cache and in releases < 2.1 memtables. If most objects die in survivor space you've solved the long GC pauses for both young gen and tenured spaces.

        As a data point with the mixed cluster we're we've been experimenting with these options most aggressively the longest CMS pause in a 24 hour period went from > 10s to less than 900ms and most nodes experienced a max of less than 500ms. This is just the max CMS which could include an outlier like defragmentation. Average CMS is significantly less, less than 100ms. For ParNew collections we went from many many pauses in excess of 200ms to a max of 15ms cluster wide and an average of 5ms. ParNew collection frequency decreased from 1 per second to one every 10s worst case to the average case of one every 16 seconds.

        This also unlocks additional throughput on large machines. For 20 cores machines I was able to increase throughput from 75k TPS to 110-120k TPS. For a 40 core machine we more than doubled request throughput and significantly increased compaction throughput.

        I've asked a number of other larger customers to help validate the new settings. I now view GC pauses as a mostly solvable issue.

        Show
        Matt Stump added a comment - I wanted to add more evidence and exposition for the above recommendations so that my argument can be better understood. Young generation GC is pretty simple. The new heap is broken down into three segments, eden, and two survivor spaces s0 and s1. All new objects are allocated in eden, and once eden reaches a size threshold a minor GC is triggered. After the minor GC all surviving objects are moved to one of the survivor spaces, only one survivor space is active at a time. At the same time that GC for eden is triggered a GC for the active survivor space is also triggered. All live objects from both eden and the active survivor space are copied to the other survivor space and both eden and the previously active survivor space is wiped clean. Objects will bounce between the different survivor spaces until the MaxTenuringThreshold is hit (default in C* is 1). Once an object survives MaxTenuringThreshold number of collections it's copied to the tenured space which is governed by a different collector, in our instance CMS, but it could just as easily be G1. This act of copying is called promotion. The promotion from young generation to tenured space is what takes a long time. So if you see long ParNew GC pauses it's because many objects are being promoted. You decrease ParNew collection times by decreasing promotion. What can cause many objects to be promoted? It's objects that have survived both the initial eden space collection and MaxTenuringThreshold number of collections in the survivor space. The main tunables are the size of the various spaces in young gen, and the MaxTenuringThreshold. By increasing the young generation space it decreases the frequency at which we have to run GC because more objects can accumulate before we reach 75% capacity. By increasing the young generation and the MaxTenuringThreshold you give the short lived objects more time to die, and dead objects don't get promoted. The vast majority of objects in C* are ephemeral short lived objects. The only thing that should live in tenured space is the key cache and in releases < 2.1 memtables. If most objects die in survivor space you've solved the long GC pauses for both young gen and tenured spaces. As a data point with the mixed cluster we're we've been experimenting with these options most aggressively the longest CMS pause in a 24 hour period went from > 10s to less than 900ms and most nodes experienced a max of less than 500ms. This is just the max CMS which could include an outlier like defragmentation. Average CMS is significantly less, less than 100ms. For ParNew collections we went from many many pauses in excess of 200ms to a max of 15ms cluster wide and an average of 5ms. ParNew collection frequency decreased from 1 per second to one every 10s worst case to the average case of one every 16 seconds. This also unlocks additional throughput on large machines. For 20 cores machines I was able to increase throughput from 75k TPS to 110-120k TPS. For a 40 core machine we more than doubled request throughput and significantly increased compaction throughput. I've asked a number of other larger customers to help validate the new settings. I now view GC pauses as a mostly solvable issue.
        Hide
        Jeremiah Jordan added a comment -

        While I agree all of that sounds nice for read heavy workloads. Have you used these settings with a write heavy workload?

        From my experience when you have a write heavy workload, your young gen fills up with memtable data, which will and should be promoted to old gen. So if you set your young gen size high, it takes for ever to copy all that stuff to old gen. If you increase the MaxTenuringThreshold it makes that even worse, as all of the memtable data has to get copied back and forth inside young gen X times, and then there is even more memtable stuff which will build up, so the copy to old gen takes that much longer.

        Show
        Jeremiah Jordan added a comment - While I agree all of that sounds nice for read heavy workloads. Have you used these settings with a write heavy workload? From my experience when you have a write heavy workload, your young gen fills up with memtable data, which will and should be promoted to old gen. So if you set your young gen size high, it takes for ever to copy all that stuff to old gen. If you increase the MaxTenuringThreshold it makes that even worse, as all of the memtable data has to get copied back and forth inside young gen X times, and then there is even more memtable stuff which will build up, so the copy to old gen takes that much longer.
        Hide
        Matt Stump added a comment -

        I don't disagree with your experience but I do disagree with the description of what is happening. With the GC frequency that I described above the memtable will be moved to tenured space after about 60-80 seconds. All of the individual requests will create ephemeral objects which would be ideally handled by ParNew.

        Where we went wrong was growing the heap but not also increasing MaxTenuringThreshold. By default we set MaxTenuringThreshold to 1 which means promote everything that survives 2 GCs to tenured, which coupled with a small heap for the workload results in a very high promotion rate which is why we see the delays. The key is to always increase MaxTenuringThreshold and young gen more or less proportionally. From the perspective of GC and the creation rate for ephemeral objects reads and writes are more or less identical. One could possibly even make the case that writes are even better suited for the settings I've outlined above because writes should put less presure on eden due to the simpler request path. In my opinion, and I hope to have data to back this up soon, is that write heavy vs read heavy GC tuning I think is mostly a red herring.

        Show
        Matt Stump added a comment - I don't disagree with your experience but I do disagree with the description of what is happening. With the GC frequency that I described above the memtable will be moved to tenured space after about 60-80 seconds. All of the individual requests will create ephemeral objects which would be ideally handled by ParNew. Where we went wrong was growing the heap but not also increasing MaxTenuringThreshold. By default we set MaxTenuringThreshold to 1 which means promote everything that survives 2 GCs to tenured, which coupled with a small heap for the workload results in a very high promotion rate which is why we see the delays. The key is to always increase MaxTenuringThreshold and young gen more or less proportionally. From the perspective of GC and the creation rate for ephemeral objects reads and writes are more or less identical. One could possibly even make the case that writes are even better suited for the settings I've outlined above because writes should put less presure on eden due to the simpler request path. In my opinion, and I hope to have data to back this up soon, is that write heavy vs read heavy GC tuning I think is mostly a red herring.
        Hide
        Matt Stump added a comment -

        Just to emphasize the point I just got word of another unrelated customer that rolled out the changes. Here is a graph of their GC activity. Additionally, write latency was cut in half.

        Show
        Matt Stump added a comment - Just to emphasize the point I just got word of another unrelated customer that rolled out the changes. Here is a graph of their GC activity. Additionally, write latency was cut in half.
        Hide
        T Jake Luciani added a comment -

        Let's run some cstar tests with write and read workloads...

        Show
        T Jake Luciani added a comment - Let's run some cstar tests with write and read workloads...
        Hide
        T Jake Luciani added a comment -

        I also learned we should not be using biased locking. Here is a sample run showing 2.1 without and with biased locking disabled

        http://cstar.datastax.com/graph?stats=0f0ec9a6-710c-11e4-af11-bc764e04482c&metric=op_rate&operation=2_read&smoothing=1&show_aggregates=true&xmin=0&xmax=98.89&ymin=0&ymax=273028.8

        -XX:-UseBiasedLocking
        
        Show
        T Jake Luciani added a comment - I also learned we should not be using biased locking. Here is a sample run showing 2.1 without and with biased locking disabled http://cstar.datastax.com/graph?stats=0f0ec9a6-710c-11e4-af11-bc764e04482c&metric=op_rate&operation=2_read&smoothing=1&show_aggregates=true&xmin=0&xmax=98.89&ymin=0&ymax=273028.8 -XX:-UseBiasedLocking
        Hide
        Liang Xie added a comment -

        IMHO, the MaxTenuringThreshold setting should be different per cluster, or say, per read/write pattern. In past, when i tuned another similar NoSQL system in our internal production clusters, i found i need to set it to different value to get a optimized status(e.g. one cluster is 3, but another cluster is 8 or sth else)
        T Jake Luciani, totally agreed with your point! i had seen several long safepoint due to using biased locking in Hadoop system

        Show
        Liang Xie added a comment - IMHO, the MaxTenuringThreshold setting should be different per cluster, or say, per read/write pattern. In past, when i tuned another similar NoSQL system in our internal production clusters, i found i need to set it to different value to get a optimized status(e.g. one cluster is 3, but another cluster is 8 or sth else) T Jake Luciani , totally agreed with your point! i had seen several long safepoint due to using biased locking in Hadoop system
        Hide
        Jonathan Ellis added a comment -

        T Jake Luciani are you going to run the tests or do you want to delegate to Ryan McGuire's team?

        Show
        Jonathan Ellis added a comment - T Jake Luciani are you going to run the tests or do you want to delegate to Ryan McGuire 's team?
        Hide
        T Jake Luciani added a comment -

        I can run it through some workloads...

        Show
        T Jake Luciani added a comment - I can run it through some workloads...
        Hide
        Oleg Anastasyev added a comment - - edited

        Let me add 2 cents GC options from our production clusters, which i believe could be useful for all:

        -XX:+ParallelRefProcEnabled
        -XX:+CMSParallelInitialMarkEnabled
        to improve CMS parallelization

        I'd also propose add -XX:+DisableExplicitGC , b/c java RMI runtime invokes a System.gc every hour or increasing sun.rmi.dgc.server.gcInterval to infinity.
        We found, that on Sun JVMs 7 and 8 CMSWait durations are not governed by CMS when GC is initiated by System.gc and this was the reason for most long rescan phase pauses.

        Show
        Oleg Anastasyev added a comment - - edited Let me add 2 cents GC options from our production clusters, which i believe could be useful for all: -XX:+ParallelRefProcEnabled -XX:+CMSParallelInitialMarkEnabled to improve CMS parallelization I'd also propose add -XX:+DisableExplicitGC , b/c java RMI runtime invokes a System.gc every hour or increasing sun.rmi.dgc.server.gcInterval to infinity. We found, that on Sun JVMs 7 and 8 CMSWait durations are not governed by CMS when GC is initiated by System.gc and this was the reason for most long rescan phase pauses.
        Hide
        Wei Deng added a comment -

        Since we're proposing to run with a much larger Eden space and -XX:+CMSScavengeBeforeRemark, we might want to watch out if there are frequent occurrences of "GCLocker Initiated GC" in the gc.log. According to this discussion: http://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2014-September/010650.html, there is a possibility for Scavenge to collide with GCLocker and makes Remark to go longer than expected, and "GCLocker Initiated GC" can happen unnecessarily due to this outstanding JDK bug https://bugs.openjdk.java.net/browse/JDK-8048556.

        Show
        Wei Deng added a comment - Since we're proposing to run with a much larger Eden space and -XX:+CMSScavengeBeforeRemark, we might want to watch out if there are frequent occurrences of "GCLocker Initiated GC" in the gc.log. According to this discussion: http://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2014-September/010650.html , there is a possibility for Scavenge to collide with GCLocker and makes Remark to go longer than expected, and "GCLocker Initiated GC" can happen unnecessarily due to this outstanding JDK bug https://bugs.openjdk.java.net/browse/JDK-8048556 .
        Hide
        Wei Deng added a comment -

        Can we also test increasing InitialCodeCacheSize and ReservedCodeCacheSize to 256MB? According to this JavaOne 2013 talk from a Twitter engineer (https://oracleus.activeevents.com/2013/connect/fileDownload/session/DF4EE14FC64B279BC4C9E699C0231622/CON4540_Keenan-JavaOne2013-CON4540-Keenan.pdf, page 47), the default settings are too low.

        Show
        Wei Deng added a comment - Can we also test increasing InitialCodeCacheSize and ReservedCodeCacheSize to 256MB? According to this JavaOne 2013 talk from a Twitter engineer ( https://oracleus.activeevents.com/2013/connect/fileDownload/session/DF4EE14FC64B279BC4C9E699C0231622/CON4540_Keenan-JavaOne2013-CON4540-Keenan.pdf , page 47), the default settings are too low.
        Hide
        Pierre Laporte added a comment -

        Matt Stump By any chance, have you collected Cassandra gc logs against various scenarios? That would be really valuable to find the right values.

        I ran a test of java-driver against a C* instance on GCE n1-standard-1 server (1 vCPU, 3,75 GB RAM). The young generation size was 100 MB (80MB for Eden, 10MB for each survivor) and the old generation size was 2,4GB.

        I had the following:

        • Average allocation rate: 352MB/s (outliers above 600MB/s)
        • 4.5 DefNew cycles per second
        • 1 CMS cycle every 10 minutes

        Therefore, during the test, Cassandra was promoting objects at a rate of 3,8MB/s.

        I think the size of Eden could be determined mostly by the allocation rate and the DefNew/ParNew frequency we want to achieve. Here, for instance, I would rather have had a bigger young generation to have ~1 DefNew cycle/s.

        I did not enable -XX:+PrintTenuringDistribution so I do not know whether the objects were prematurely promoted. That would have given pointers on survivors sizing as well.

        Do you have any gc logs with such flag ?

        Show
        Pierre Laporte added a comment - Matt Stump By any chance, have you collected Cassandra gc logs against various scenarios? That would be really valuable to find the right values. I ran a test of java-driver against a C* instance on GCE n1-standard-1 server (1 vCPU, 3,75 GB RAM). The young generation size was 100 MB (80MB for Eden, 10MB for each survivor) and the old generation size was 2,4GB. I had the following: Average allocation rate: 352MB/s (outliers above 600MB/s) 4.5 DefNew cycles per second 1 CMS cycle every 10 minutes Therefore, during the test, Cassandra was promoting objects at a rate of 3,8MB/s. I think the size of Eden could be determined mostly by the allocation rate and the DefNew/ParNew frequency we want to achieve. Here, for instance, I would rather have had a bigger young generation to have ~1 DefNew cycle/s. I did not enable -XX:+PrintTenuringDistribution so I do not know whether the objects were prematurely promoted. That would have given pointers on survivors sizing as well. Do you have any gc logs with such flag ?
        Hide
        Jeremy Hanna added a comment -

        Have we made any progress towards determining whether these are reasonable new defaults? It sounds like there is good evidence suggesting they are good but that we are waiting on tests and perhaps some gc logs?

        Show
        Jeremy Hanna added a comment - Have we made any progress towards determining whether these are reasonable new defaults? It sounds like there is good evidence suggesting they are good but that we are waiting on tests and perhaps some gc logs?
        Hide
        Brandon Williams added a comment -

        The next step here is to have someone on Ryan McGuire's team do some comparison runs on cstar and proceed with any empirical evidence that provides.

        Show
        Brandon Williams added a comment - The next step here is to have someone on Ryan McGuire 's team do some comparison runs on cstar and proceed with any empirical evidence that provides.
        Show
        Albert P Tobey added a comment - It appears that -XX:+UseGCTaskAffinity is a noop in hotspot. https://engineering.linkedin.com/garbage-collection/garbage-collection-optimization-high-throughput-and-low-latency-java-applications
        Hide
        Anuj Wadehra added a comment - - edited

        We have write heavy workload and used to face promotion failures/long gc pauses with Cassandra 2.0.x. I am not into code yet but I think that memtable and compaction related objects have mid-life and write heavy workload is not suitable for generation collection by default. So, we tuned JVM to make sure that minimum objects are promoted to Old Gen and achieved great success in that:

        MAX_HEAP_SIZE="12G"
        HEAP_NEWSIZE="3G"
        -XX:SurvivorRatio=2
        -XX:MaxTenuringThreshold=20
        -XX:CMSInitiatingOccupancyFraction=70
        JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=20"
        JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
        JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
        JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
        JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768"
        JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
        JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=30000"
        JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=2000"
        JVM_OPTS="$JVM_OPTS -XX:+CMSEdenChunksRecordAlways"
        JVM_OPTS="$JVM_OPTS -XX:+CMSParallelInitialMarkEnabled"
        JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"

        We also think that default total_memtable_space_in_mb=1/4 heap is too much for write heavy loads. By default, young gen is also 1/4 heap.We reduced it to 1000mb in order to make sure that memtable related objects dont stay in memory for too long. Combining this with SurvivorRatio=2 and MaxTenuringThreshold=20 did the job well. GC was very consistent. No Full GC observed.

        Environment: 3 node cluster with each node having 24cores,64G RAM and SSDs in RAID5.

        We are making around 12k writes/sec in 5 cf (one with 4 sec index) and 2300 reads/sec on each node of 3 node cluster. 2 CFs have wide rows with max data of around 100mb per row

        Show
        Anuj Wadehra added a comment - - edited We have write heavy workload and used to face promotion failures/long gc pauses with Cassandra 2.0.x. I am not into code yet but I think that memtable and compaction related objects have mid-life and write heavy workload is not suitable for generation collection by default. So, we tuned JVM to make sure that minimum objects are promoted to Old Gen and achieved great success in that: MAX_HEAP_SIZE="12G" HEAP_NEWSIZE="3G" -XX:SurvivorRatio=2 -XX:MaxTenuringThreshold=20 -XX:CMSInitiatingOccupancyFraction=70 JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=20" JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions" JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity" JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs" JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768" JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark" JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=30000" JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=2000" JVM_OPTS="$JVM_OPTS -XX:+CMSEdenChunksRecordAlways" JVM_OPTS="$JVM_OPTS -XX:+CMSParallelInitialMarkEnabled" JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking" We also think that default total_memtable_space_in_mb=1/4 heap is too much for write heavy loads. By default, young gen is also 1/4 heap.We reduced it to 1000mb in order to make sure that memtable related objects dont stay in memory for too long. Combining this with SurvivorRatio=2 and MaxTenuringThreshold=20 did the job well. GC was very consistent. No Full GC observed. Environment: 3 node cluster with each node having 24cores,64G RAM and SSDs in RAID5. We are making around 12k writes/sec in 5 cf (one with 4 sec index) and 2300 reads/sec on each node of 3 node cluster. 2 CFs have wide rows with max data of around 100mb per row
        Hide
        Jeremy Hanna added a comment -

        From conversations with Matt Stump and Albert P Tobey about these settings, the only settings that are generally applicable are:

        JVM_OPTS="$JVM_OPTS -XX:+PerfDisableSharedMem"
        JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"

        All other settings need to be tuned using A|B testing one at a time and specific to the environment and use case.

        It was also noted that 8G heap and 2G newsize was a good general starting point.

        Show
        Jeremy Hanna added a comment - From conversations with Matt Stump and Albert P Tobey about these settings, the only settings that are generally applicable are: JVM_OPTS="$JVM_OPTS -XX:+PerfDisableSharedMem" JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking" All other settings need to be tuned using A|B testing one at a time and specific to the environment and use case. It was also noted that 8G heap and 2G newsize was a good general starting point.
        Hide
        Matt Stump added a comment -

        We should probably close this ticket. As Jeremy Hanna stated very few optimizations with CMS appear to be generally applicable. I think the existing defaults for tenuring threshold, survivor space, and new heap aren't good, but rather than tune CMS I would rather use G1 as part of CASSANDRA-7486 and allow it to self tune. Getting CMS to perform for peak throughput whilst minimizing latency is unfortunately an application specific and iterative process.

        Show
        Matt Stump added a comment - We should probably close this ticket. As Jeremy Hanna stated very few optimizations with CMS appear to be generally applicable. I think the existing defaults for tenuring threshold, survivor space, and new heap aren't good, but rather than tune CMS I would rather use G1 as part of CASSANDRA-7486 and allow it to self tune. Getting CMS to perform for peak throughput whilst minimizing latency is unfortunately an application specific and iterative process.
        Hide
        Anuj Wadehra added a comment -

        As 2.0.15 is the Production ready version of Cassandra,can you please confirm whether G1 Collector has been thoroughly tested on 2.0.x ? Or should Production users on 2.0.x stick with CMS? 2.1 and 3.0 are still not production ready and default JVM settings for 2.0.x never worked for us

        Show
        Anuj Wadehra added a comment - As 2.0.15 is the Production ready version of Cassandra,can you please confirm whether G1 Collector has been thoroughly tested on 2.0.x ? Or should Production users on 2.0.x stick with CMS? 2.1 and 3.0 are still not production ready and default JVM settings for 2.0.x never worked for us
        Hide
        Matt Stump added a comment -

        2.1 is production ready. We've got a number of DataStax customers using G1 with both 2.0.X and 2.1.X. Some of the comments in CASSANDRA-7486 are a direct result of those experiences.

        Show
        Matt Stump added a comment - 2.1 is production ready. We've got a number of DataStax customers using G1 with both 2.0.X and 2.1.X. Some of the comments in CASSANDRA-7486 are a direct result of those experiences.
        Hide
        Benedict added a comment -

        2.1.6, which is voting now, will be given the "production ready" label. There are still some problems with 2.1.5 that prevent us giving it that label. My understanding is that the DSE release does not have these problems.

        Show
        Benedict added a comment - 2.1.6, which is voting now, will be given the "production ready" label. There are still some problems with 2.1.5 that prevent us giving it that label. My understanding is that the DSE release does not have these problems.
        Hide
        Albert P Tobey added a comment - - edited

        I did some testing on EC2 with Cassandra 2.0 and G1GC and found the following settings to work well. Make sure to comment out the -Xmn line as shown.

        
        MAX_HEAP_SIZE="16G"
        HEAP_NEWSIZE=”2G” # placeholder, ignored
        
        # setting -Xmn breaks G1GC, don’t do it
        #JVM_OPTS="$JVM_OPTS -Xmn${HEAP_NEWSIZE}"
        
        # G1GC support atobey@datastax.com 2015-04-03
        JVM_OPTS="$JVM_OPTS -XX:+UseG1GC"
        
        # Cassandra does not benefit from biased locking
        JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"
        
        # lowering the pause target will lower throughput
        # 200ms is the default and lowest viable setting for G1GC
        # 1000ms seems to provide good balance of throughput and latency
        JVM_OPTS="$JVM_OPTS -XX:MaxGCPauseMillis=1000"
        
        # auto-optimize thread local allocation block size
        # https://blogs.oracle.com/jonthecollector/entry/the_real_thi
        JVM_OPTS="$JVM_OPTS -XX:+UseTLAB -XX:+ResizeTLAB"
        
        
        Show
        Albert P Tobey added a comment - - edited I did some testing on EC2 with Cassandra 2.0 and G1GC and found the following settings to work well. Make sure to comment out the -Xmn line as shown. MAX_HEAP_SIZE= "16G" HEAP_NEWSIZE=”2G” # placeholder, ignored # setting -Xmn breaks G1GC, don’t do it #JVM_OPTS= "$JVM_OPTS -Xmn${HEAP_NEWSIZE}" # G1GC support atobey@datastax.com 2015-04-03 JVM_OPTS= "$JVM_OPTS -XX:+UseG1GC" # Cassandra does not benefit from biased locking JVM_OPTS= "$JVM_OPTS -XX:-UseBiasedLocking" # lowering the pause target will lower throughput # 200ms is the default and lowest viable setting for G1GC # 1000ms seems to provide good balance of throughput and latency JVM_OPTS= "$JVM_OPTS -XX:MaxGCPauseMillis=1000" # auto-optimize thread local allocation block size # https: //blogs.oracle.com/jonthecollector/entry/the_real_thi JVM_OPTS= "$JVM_OPTS -XX:+UseTLAB -XX:+ResizeTLAB"

          People

          • Assignee:
            Unassigned
            Reporter:
            Matt Stump
          • Votes:
            12 Vote for this issue
            Watchers:
            56 Start watching this issue

            Dates

            • Created:
              Updated:

              Development