Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-15109

nodetool repair failing with "Validation failed in /10.222.5.44"

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Triage Needed
    • Normal
    • Resolution: Unresolved
    • None
    • Tool/nodetool
    • None
    • All
    • None

    Description

      Cassandra Version: 2.2.13

      Command

       

      nodetool -h 127.0.0.1 -p 7199 repair -pr -full

       

      Sample Output

       

      Repair session c230e910-6d74-11e9-8952-a70261a0ced8 for range (4812194106185100517,5213210281700525452] failed with error [repair #c230e910-6d74-11e9-8952-a70261a0ced8 on ks/table, (4812194106185100517,5213210281700525452]] Validation failed in /10.223.5.44 (progress: 100%)
      

       

      On the mentioned node we have the following info logged...

       

      May  3 13:26:13 XXXXXXXX cassandra: ERROR 11:26:13 Failed creating a merkle tree for [repair #8a6859c0-6d95-11e9-b769-5964d82f38b1 on ks/table, (4812194106185100517,5213210281700525452]], /X.X.5.42 (see log for details)

       

      These are always (as seen so far) preceeded  by...

       

      Apr 29 00:45:04 XXXXXXXX cassandra: INFO 22:45:04 InetAddress /X.X.5.42 is now DOWN
      Apr 29 00:45:09 XXXXXXXX cassandra: INFO 22:45:09 Handshaking version with /10.223.5.42
      Apr 29 00:45:09 XXXXXXXX cassandra: INFO 22:45:09 InetAddress /X.X.5.42 is now UP

       

      and followed by a Java stack Trace...

       

      Apr 29 00:45:10 XXXXXXXX cassandra: ERROR 22:45:10 Exception in thread Thread[ValidationExecutor:43,1,main]
      Apr 29 00:45:10 XXXXXXXX cassandra: java.lang.RuntimeException: Parent repair session with id = 8f9fe6c0-6a06-11e9-bd05-21e986c06e90 has failed.
      Apr 29 00:45:10 XXXXXXXX cassandra: at org.apache.cassandra.service.ActiveRepairService.getParentRepairSession(ActiveRepairService.java:398) ~[apache-cassandra-2.2.13.jar:2.2.13]
      Apr 29 00:45:10 XXXXXXXX cassandra: at org.apache.cassandra.db.compaction.CompactionManager.getSSTablesToValidate(CompactionManager.java:1206) ~[apache-cassandra-2.2.13.jar:2.2.13]
      Apr 29 00:45:10 XXXXXXXX cassandra: at org.apache.cassandra.db.compaction.CompactionManager.doValidationCompaction(CompactionManager.java:1131) ~[apache-cassandra-2.2.13.jar:2.2.13]
      Apr 29 00:45:10 XXXXXXXX cassandra: at org.apache.cassandra.db.compaction.CompactionManager.access$600(CompactionManager.java:76) ~[apache-cassandra-2.2.13.jar:2.2.13]
      Apr 29 00:45:10 XXXXXXXX cassandra: at org.apache.cassandra.db.compaction.CompactionManager$10.call(CompactionManager.java:736) ~[apache-cassandra-2.2.13.jar:2.2.13]
      Apr 29 00:45:10 XXXXXXXX cassandra: at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_172]
      Apr 29 00:45:10 XXXXXXXX cassandra: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_172]
      Apr 29 00:45:10 XXXXXXXX cassandra: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_172]
      Apr 29 00:45:10 XXXXXXXX cassandra: at java.lang.Thread.run(Thread.java:748) [na:1.8.0_172]
      Apr 29 00:45:10 XXXXXXXX cassandra: INFO 22:45:10 Writing Memtable-compactions_in_progress@2106381056(0.156KiB serialized bytes, 9 ops, 0%/0% of on/off-heap limit)
      Apr 29 00:45:10 XXXXXXXX cassandra: INFO 22:45:10 Handshaking version with /10.223.5.42
      Apr 29 00:45:10 XXXXXXXX cassandra: INFO 22:45:10 Writing Memtable-compactions_in_progress@134296463(0.008KiB serialized bytes, 1 ops, 0%/0% of on/off-heap limit)
      Apr 29 00:45:10 XXXXXXXX cassandra: ERROR 22:45:10 Got error, removing parent repair session
      Apr 29 00:45:10 XXXXXXXX cassandra: ERROR 22:45:10 Exception in thread Thread[AntiEntropyStage:1,5,main]
      Apr 29 00:45:10 XXXXXXXX cassandra: java.lang.RuntimeException: java.lang.RuntimeException: Parent repair session with id = 8f9fe6c0-6a06-11e9-bd05-21e986c06e90 has failed.
      Apr 29 00:45:10 XXXXXXXX cassandra: at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:183) ~[apache-cassandra-2.2.13.jar:2.2.13]
      Apr 29 00:45:10 XXXXXXXX cassandra: at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:67) ~[apache-cassandra-2.2.13.jar:2.2.13]
      Apr 29 00:45:10 XXXXXXXX cassandra: at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_172]
      Apr 29 00:45:10 XXXXXXXX cassandra: at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_172]
      Apr 29 00:45:10 XXXXXXXX cassandra: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_172]
      Apr 29 00:45:10 XXXXXXXX cassandra: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_172]
      Apr 29 00:45:10 XXXXXXXX cassandra: at java.lang.Thread.run(Thread.java:748) [na:1.8.0_172]
      Apr 29 00:45:10 XXXXXXXX cassandra: Caused by: java.lang.RuntimeException: Parent repair session with id = 8f9fe6c0-6a06-11e9-bd05-21e986c06e90 has failed.
      Apr 29 00:45:10 XXXXXXXX cassandra: at org.apache.cassandra.service.ActiveRepairService.getParentRepairSession(ActiveRepairService.java:398) ~[apache-cassandra-2.2.13.jar:2.2.13]
      Apr 29 00:45:10 XXXXXXXX cassandra: at org.apache.cassandra.service.ActiveRepairService.doAntiCompaction(ActiveRepairService.java:432) ~[apache-cassandra-2.2.13.jar:2.2.13]
      Apr 29 00:45:10 XXXXXXXX cassandra: at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:155) ~[apache-cassandra-2.2.13.jar:2.2.13]
      Apr 29 00:45:10 XXXXXXXX cassandra: ... 6 common frames omitted

       

      I've tried a few combinations of options with the nodetool repair command. Here are the results...

       

      parallelism: parallel, primary range: true, incremental: false - NOK
      parallelism: parallel, primary range: false, incremental: false - NOK
      parallelism: parallel, primary range: false, incremental: false - NOK
      parallelism: sequential, primary range: false, incremental: false - NOK (Although I get a different error failed with error Could not create snapshot at /X.X.5.43 (progress: 60%))
      parallelism: parallel, primary range: false, incremental: true - OK
      

      This only started happening relatively recently. There's been no major, or minor changes, to our system that we think would result in this. This is happening on every node in one DC and on a few in the second. The "Failed creating merkle tree" error is present on every node but most of the nodes in the second DC seem to complete their repair. 

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            rhys Rhys Ulerich
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: