Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-15119

Repair fails randomly, causing nodes to restart

    XMLWordPrintableJSON

    Details

    • Platform:
      All
    • Impacts:
      None

      Description

      We have a cluster of 3 nodes (same dc) that is ~8GB on disk (per node). One keyspace has two tables, combined having about 20m rows with around 20 colums each. Whenever we try to run a repair (with or without cassandra-reaper, on any setting) the repair causes certain nodes to fail and restart. Originally these nodes had the default heap space calculation on a device with 12GB ram.

      We upscaled these to 24GB ram and 12GB XMX which seemed to make a difference but still not quite enough. With JProfiler we can see that random nodes reach the xmx limit, regardless of the size of the repair, while streaming data.

      I can't understand that such operations can cause servers to literally crash rather than just say "no I can't do it". We've tried a lot of things including setting up a fresh cluster and manually inserting all the data (with the correct replication factor) and then run repairs.

      Sometimes they will work (barely) sometimes they will fail. I really don't understand.

      We're running cassandra 3.11.4.  

      Could I receive some assistance in troubleshooting this?

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              Brentc Brent
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: