Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-15400

Cassandra 3.0.18 went OOM several hours after joining a cluster

    XMLWordPrintableJSON

Details

    Description

      We have been moving from Cassandra 2.1.18 to Cassandra 3.0.18 and have been facing an OOM two times with 3.0.18 on newly added nodes joining an existing cluster after several hours being successfully bootstrapped.

      Running in AWS:

      • m5.2xlarge, EBS SSD (gp2)
      • Xms/Xmx12G, Xmn3G, CMS GC, OpenJDK8u222
      • 4 compaction threads, throttling set to 32 MB/s

      What we see is a steady increase in the OLD gen over many hours.

      • The node started to join / auto-bootstrap the cluster on Oct 30 ~ 12:00
      • It basically finished joining the cluster (UJ => UN) ~ 19hrs later on Oct 31 ~ 07:00 also starting to be a member of serving client read requests

      Memory-wise (on-heap) it didn't look that bad at that time, but old gen usage constantly increased.

      We see a correlation in increased number of SSTables and pending compactions.

      Until we reached the OOM somewhere in Nov 1 in the night. After a Cassandra startup (metric gap in the chart above), number of SSTables + pending compactions is still high, but without facing memory troubles since then.

      This correlation is confirmed by the auto-generated heap dump with e.g. ~ 5K BigTableReader instances with ~ 8.7GByte retained heap in total.

      Having a closer look on a single object instance, seems like each instance is ~ 2MByte in size.

      With 2 pre-allocated byte buffers (highlighted in the screen above) at 1 MByte each

      We have been running with 2.1.18 for > 3 years and I can't remember dealing with such OOM in the context of extending a cluster.

      While the MAT screens above are from our production cluster, we partly can reproduce this behavior in our loadtest environment (although not going full OOM there), thus I might be able to share a hprof from this non-prod environment if needed.

      Thanks a lot.

      Attachments

        1. oldgen_increase_nov12.jpg
          113 kB
          Thomas Steinmaurer
        2. image.png
          102 kB
          Alex Petrov
        3. cassandra_sstables_pending_compactions.png
          18 kB
          Thomas Steinmaurer
        4. cassandra_operationcount.png
          13 kB
          Thomas Steinmaurer
        5. cassandra_jvm_metrics.png
          39 kB
          Thomas Steinmaurer
        6. cassandra_hprof_statsmetadata.png
          66 kB
          Thomas Steinmaurer
        7. cassandra_hprof_dominator_classes.png
          27 kB
          Thomas Steinmaurer
        8. cassandra_hprof_bigtablereader_statsmetadata.png
          96 kB
          Thomas Steinmaurer

        Activity

          People

            bdeggleston Blake Eggleston
            tsteinmaurer Thomas Steinmaurer
            Blake Eggleston
            Blake Eggleston, Marcus Eriksson
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: