Uploaded image for project: 'Apache Cassandra'
  1. Apache Cassandra
  2. CASSANDRA-12860

Nodetool repair fragile: cannot properly recover from single node failure. Has to restart all nodes in order to repair again

Agile BoardAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Urgent
    • Resolution: Duplicate
    • None
    • None
    • None
    • CentOS 6.7, Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode), Cassandra 3.5.0, fresh install

    • Critical

    Description

      Summary of symptom:

      • Set up is a multi-region cluster in AWS (5 regions). Each region has at least 4 hosts with RF=1/2 number of nodes, using V-nodes (256)
      • How to reproduce:
        • On node A, start this repair job (again we are running fresh 3.5.0):
          nohup sudo nodetool repair -j 2 -pr -full myks > /tmp/repair.log 2>&1 &
        • Job starts fine, reporting progress like
          [2016-10-28 22:37:52,692] Starting repair command #1, repairing keyspace myks with repair options (parallelism: parallel, primary range: true, incremental: false, job threads: 2, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 256)
          [2016-10-28 22:38:35,099] Repair session 36f13450-9d5f-11e6-8bf7-a9f47ff986a9 for range [(4029874034937227774,4033949979656106020]] finished (progress: 1%)
          [2016-10-28 22:38:38,769] Repair session 36f30910-9d5f-11e6-8bf7-a9f47ff986a9 for range [(-2395606719402271267,-2394525508513518837]] finished (progress: 1%)
          [2016-10-28 22:38:48,521] Repair session 36f3f370-9d5f-11e6-8bf7-a9f47ff986a9 for range [(-5223108861718702793,-5221117649630514419]] finished (progress: 2%)
          
        • Then manually shutdown another node (node B) in the same region (haven't tried with other region yet but expect the same behavior from past experience)
        • Shortly after that seeing this message from job log (as well as in system.log) on node A:
          [2016-10-28 22:41:46,268] Repair session 37088ce1-9d5f-11e6-8bf7-a9f47ff986a9 for range [(-928974038666914990,-927967994563261540]] failed with error Endpoint /node_B_ip died (progress: 51%)
          
        • From this point on, repair job seems to hang:
          • no further messages from job log
          • nor any related messages in system.log
          • CPU stayed low (low single digit percent of 1 CPU)
        • After an hour (1hr), manually kill the repair jobs using "ps -eaf | grep repair"
        • Restart C* on node A
          • Verified system is up and no error messages in system.log
          • Also verified that there is no error messages from node B
        • After node A settles down (e.g. no new messages from system.log), restart the same repair job:
          nohup sudo nodetool repair -j 2 -pr -full myks > /tmp/repair.log 2>&1 &
        • Job failes pretty quickly, reporting error from more nodes B and K:
           <production>[ywu@cass-tm-1b-012.apse1.mashery.com ~]$ tail -f /tmp/repair.log 
          [2016-10-28 22:49:52,965] Starting repair command #1, repairing keyspace myks with repair options (parallelism: parallel, primary range: true, incremental: false, job threads: 2, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 256)
          [2016-10-28 22:50:15,839] Repair session e4180720-9d60-11e6-b2f9-cb9524b3c536 for range [(4029874034937227774,4033949979656106020]] failed with error [repair #e4180720-9d60-11e6-b2f9-cb9524b3c536 on myks/rtable, [(4029874034937227774,4033949979656106020]]] Validation failed in /node_K_ip (progress: 1%)
          [2016-10-28 22:50:17,158] Repair session e419dbe0-9d60-11e6-b2f9-cb9524b3c536 for range [(-2395606719402271267,-2394525508513518837]] failed with error [repair #e419dbe0-9d60-11e6-b2f9-cb9524b3c536 on myks/rtable, [(-2395606719402271267,-2394525508513518837]]] Validation failed in /node_B_ip (progress: 1%)
          [2016-10-28 22:50:18,256] Repair session e41b1460-9d60-11e6-b2f9-cb9524b3c536 for range [(-5223108861718702793,-5221117649630514419]] failed with error [repair #e41b1460-9d60-11e6-b2f9-cb9524b3c536 on myks/rtable, [(-5223108861718702793,-5221117649630514419]]] Validation failed in /node_B_ip (progress: 2%)
          
        • On the said nodes (B and K), seeing similar errors:
          ERROR [ValidationExecutor:5] 2016-10-28 22:58:45,307 CompactionManager.java:1320 - Cannot start multiple repair sessions over the same sstables
          ERROR [ValidationExecutor:5] 2016-10-28 22:58:45,307 Validator.java:261 - Failed creating a merkle tree for [repair #14378ec0-9d62-11e6-ab75-cd4d64a01b02 on myks/atable, [(4029874034937227774,4033949979656106020]]], /52.220.127.190 (see log for details)
          INFO  [AntiEntropyStage:1] 2016-10-28 22:58:45,307 Validator.java:274 - [repair #14378ec0-9d62-11e6-ab75-cd4d64a01b02] Sending completed merkle tree to /52.220.127.190 for myks.xtable
          ERROR [ValidationExecutor:5] 2016-10-28 22:58:45,308 CassandraDaemon.java:195 - Exception in thread Thread[ValidationExecutor:5,1,main]
          java.lang.RuntimeException: Cannot start multiple repair sessions over the same sstables
                  at org.apache.cassandra.db.compaction.CompactionManager.getSSTablesToValidate(CompactionManager.java:1321) ~[apache-cassandra-3.5.0.jar:3.5.0]
                  at org.apache.cassandra.db.compaction.CompactionManager.doValidationCompaction(CompactionManager.java:1211) ~[apache-cassandra-3.5.0.jar:3.5.0]
                  at org.apache.cassandra.db.compaction.CompactionManager.access$700(CompactionManager.java:81) ~[apache-cassandra-3.5.0.jar:3.5.0]
                  at org.apache.cassandra.db.compaction.CompactionManager$11.call(CompactionManager.java:841) ~[apache-cassandra-3.5.0.jar:3.5.0]
                  at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_102]
                  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_102]
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_102]
                  at java.lang.Thread.run(Thread.java:745) [na:1.8.0_102]
          INFO  [AntiEntropyStage:1] 2016-10-28 22:58:45,318 Validator.java:274 - [repair #14378ec0-9d62-11e6-ab75-cd4d64a01b02] Sending completed merkle tree to /52.220.127.190 for myks.ytable
          
        • At this point, we are back to where we were: kill the repair job on node A, then restart C* on BOTH nodes A and K, but still seeing the same exceptions except sometimes they are on other servers all over the ring.
      • Business impact: I am in the process of launch a Cassandra based production system but I have to hold back now because how fragile repair is. And I am told by many sources that I have to rely on periodical repair jobs to fix data inconsistencies.
      • The only work around was to rolling restart the Cassandra server on ALL nodes in the entire cluster
        • Then the repair job can proceed without any error

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned Assign to me
            bing1wu Bing Wu
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment