Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-16585

Periodic failures in *RepairCoordinator*Test caused by race condition with nodetool repair

    XMLWordPrintableJSON

    Details

      Description

      Periodic failures in *RepairCoordinator*Test cause errors such as

      FullRepairCoordinatorNeighbourDownTest#validationParticipentCrashesAndComesBack[DATACENTER_AWARE/true]

      nodetool command [repair, distributed_test_keyspace, validationparticipentcrashesandcomesback_full_datacenter_aware_true, --dc-parallel, --full] Error message 'Some repair failed' does not contain any of [/127.0.0.2:7012 died]
      stdout:
      [2021-04-07 22:45:24,887] Starting repair command #10 (f129cb60-97f2-11eb-9316-794aa6ab8411), repairing keyspace distributed_test_keyspace with repair options (parallelism: dc_parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [validationparticipentcrashesandcomesback_full_datacenter_aware_true], dataCenters: [], hosts: [], previewKind: NONE, # of ranges: 2, pull repair: false, force repair: false, optimise streams: false, ignore unreplicated keyspaces: false)
      [2021-04-07 22:45:32,864] Repair command #10 failed with error Repair session f1342ba0-97f2-11eb-9316-794aa6ab8411 for range [(-1,9223372036854775805], (9223372036854775805,-1]] failed with error Endpoint /127.0.0.2:7012 died
      [2021-04-07 22:45:32,887] After waiting for poll interval of 1 seconds queried for parent session status and discovered repair failed.
      [2021-04-07 22:45:32,887] Repair command #10 finished with error
      [2021-04-07 22:45:32,887] Some repair failed
      [2021-04-07 22:45:32,888] Repair command #10 finished with error
      
      stderr:
      error: Some repair failed
      -- StackTrace --
      java.io.IOException: Some repair failed
      at org.apache.cassandra.tools.RepairRunner.queryForCompletedRepair(RepairRunner.java:167)
      at org.apache.cassandra.tools.RepairRunner.run(RepairRunner.java:72)
      at org.apache.cassandra.tools.NodeProbe.repairAsync(NodeProbe.java:431)
      at org.apache.cassandra.tools.nodetool.Repair.execute(Repair.java:171)
      at org.apache.cassandra.tools.NodeTool$NodeToolCmd.runInternal(NodeTool.java:358)
      at org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:343)
      at org.apache.cassandra.tools.NodeTool.execute(NodeTool.java:246)
      at org.apache.cassandra.distributed.impl.Instance$DTestNodeTool.execute(Instance.java:836)
      at org.apache.cassandra.distributed.impl.Instance.lambda$nodetoolResult$38(Instance.java:746)
      at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
      at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
      at java.base/java.lang.Thread.run(Thread.java:834)
      
      
      Notifications:
      Notification{type=START, src=repair:10, message=Starting repair command #10 (f129cb60-97f2-11eb-9316-794aa6ab8411), repairing keyspace distributed_test_keyspace with repair options (parallelism: dc_parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [validationparticipentcrashesandcomesback_full_datacenter_aware_true], dataCenters: [], hosts: [], previewKind: NONE, # of ranges: 2, pull repair: false, force repair: false, optimise streams: false, ignore unreplicated keyspaces: false)}
      Notification{type=ERROR, src=repair:10, message=Repair command #10 failed with error Repair session f1342ba0-97f2-11eb-9316-794aa6ab8411 for range [(-1,9223372036854775805], (9223372036854775805,-1]] failed with error Endpoint /127.0.0.2:7012 died}
      Notification{type=COMPLETE, src=repair:10, message=Repair command #10 finished with error}
      Error:
      java.io.IOException: Some repair failed
      at org.apache.cassandra.tools.RepairRunner.queryForCompletedRepair(RepairRunner.java:167)
      at org.apache.cassandra.tools.RepairRunner.run(RepairRunner.java:72)
      at org.apache.cassandra.tools.NodeProbe.repairAsync(NodeProbe.java:431)
      at org.apache.cassandra.tools.nodetool.Repair.execute(Repair.java:171)
      at org.apache.cassandra.tools.NodeTool$NodeToolCmd.runInternal(NodeTool.java:358)
      at org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:343)
      at org.apache.cassandra.tools.NodeTool.execute(NodeTool.java:246)
      at org.apache.cassandra.distributed.impl.Instance$DTestNodeTool.execute(Instance.java:836)
      at org.apache.cassandra.distributed.impl.Instance.lambda$nodetoolResult$38(Instance.java:746)
      at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
      at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
      at java.base/java.lang.Thread.run(Thread.java:834)
      

      Seems there is a race condition in nodetool repair where we query the error state before we get the notification, then we throw a generic error rather than the specific error.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                dcapwell David Capwell
                Reporter:
                dcapwell David Capwell
                Authors:
                David Capwell
                Reviewers:
                Marcus Eriksson
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 10m
                  10m