Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-5424

nodetool repair -pr on all nodes won't repair the full range when a Keyspace isn't in all DC's

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Fix Version/s: 1.2.5
    • Component/s: None
    • Labels:
      None

      Description

      nodetool repair -pr on all nodes won't repair the full range when a Keyspace isn't in all DC's

      Commands follow, but the TL;DR of it, range (127605887595351923798765477786913079296,0] doesn't get repaired between .38 node and .236 node until I run a repair, no -pr, on .38

      It seems like primary arnge calculation doesn't take schema into account, but deciding who to ask for merkle tree's from does.

      Address         DC          Rack        Status State   Load            Owns                Token                                       
                                                                                                 127605887595351923798765477786913079296     
      10.72.111.225   Cassandra   rack1       Up     Normal  455.87 KB       25.00%              0                                           
      10.2.29.38      Analytics   rack1       Up     Normal  40.74 MB        25.00%              42535295865117307932921825928971026432      
      10.46.113.236   Analytics   rack1       Up     Normal  20.65 MB        50.00%              127605887595351923798765477786913079296     
      
      create keyspace Keyspace1
        with placement_strategy = 'NetworkTopologyStrategy'
        and strategy_options = {Analytics : 2}
        and durable_writes = true;
      
      -------
      # nodetool -h 10.2.29.38 repair -pr Keyspace1 Standard1
      [2013-04-03 15:46:58,000] Starting repair command #1, repairing 1 ranges for keyspace Keyspace1
      [2013-04-03 15:47:00,881] Repair session b79b4850-9c75-11e2-0000-8b5bf6ebea9e for range (0,42535295865117307932921825928971026432] finished
      [2013-04-03 15:47:00,881] Repair command #1 finished
      
      root@ip-10-2-29-38:/home/ubuntu# grep b79b4850-9c75-11e2-0000-8b5bf6ebea9e /var/log/cassandra/system.log
       INFO [AntiEntropySessions:1] 2013-04-03 15:46:58,009 AntiEntropyService.java (line 676) [repair #b79b4850-9c75-11e2-0000-8b5bf6ebea9e] new session: will sync a1/10.2.29.38, /10.46.113.236 on range (0,42535295865117307932921825928971026432] for Keyspace1.[Standard1]
       INFO [AntiEntropySessions:1] 2013-04-03 15:46:58,015 AntiEntropyService.java (line 881) [repair #b79b4850-9c75-11e2-0000-8b5bf6ebea9e] requesting merkle trees for Standard1 (to [/10.46.113.236, a1/10.2.29.38])
       INFO [AntiEntropyStage:1] 2013-04-03 15:47:00,202 AntiEntropyService.java (line 211) [repair #b79b4850-9c75-11e2-0000-8b5bf6ebea9e] Received merkle tree for Standard1 from /10.46.113.236
       INFO [AntiEntropyStage:1] 2013-04-03 15:47:00,697 AntiEntropyService.java (line 211) [repair #b79b4850-9c75-11e2-0000-8b5bf6ebea9e] Received merkle tree for Standard1 from a1/10.2.29.38
       INFO [AntiEntropyStage:1] 2013-04-03 15:47:00,879 AntiEntropyService.java (line 1015) [repair #b79b4850-9c75-11e2-0000-8b5bf6ebea9e] Endpoints /10.46.113.236 and a1/10.2.29.38 are consistent for Standard1
       INFO [AntiEntropyStage:1] 2013-04-03 15:47:00,880 AntiEntropyService.java (line 788) [repair #b79b4850-9c75-11e2-0000-8b5bf6ebea9e] Standard1 is fully synced
       INFO [AntiEntropySessions:1] 2013-04-03 15:47:00,880 AntiEntropyService.java (line 722) [repair #b79b4850-9c75-11e2-0000-8b5bf6ebea9e] session completed successfully
      
      root@ip-10-46-113-236:/home/ubuntu# grep b79b4850-9c75-11e2-0000-8b5bf6ebea9e /var/log/cassandra/system.log
       INFO [AntiEntropyStage:1] 2013-04-03 15:46:59,944 AntiEntropyService.java (line 244) [repair #b79b4850-9c75-11e2-0000-8b5bf6ebea9e] Sending completed merkle tree to /10.2.29.38 for (Keyspace1,Standard1)
      
      root@ip-10-72-111-225:/home/ubuntu# grep b79b4850-9c75-11e2-0000-8b5bf6ebea9e /var/log/cassandra/system.log
      root@ip-10-72-111-225:/home/ubuntu# 
      
      -------
      # nodetool -h 10.46.113.236  repair -pr Keyspace1 Standard1
      [2013-04-03 15:48:00,274] Starting repair command #1, repairing 1 ranges for keyspace Keyspace1
      [2013-04-03 15:48:02,032] Repair session dcb91540-9c75-11e2-0000-a839ee2ccbef for range (42535295865117307932921825928971026432,127605887595351923798765477786913079296] finished
      [2013-04-03 15:48:02,033] Repair command #1 finished
      
      root@ip-10-46-113-236:/home/ubuntu# grep dcb91540-9c75-11e2-0000-a839ee2ccbef /var/log/cassandra/system.log
       INFO [AntiEntropySessions:5] 2013-04-03 15:48:00,280 AntiEntropyService.java (line 676) [repair #dcb91540-9c75-11e2-0000-a839ee2ccbef] new session: will sync a0/10.46.113.236, /10.2.29.38 on range (42535295865117307932921825928971026432,127605887595351923798765477786913079296] for Keyspace1.[Standard1]
       INFO [AntiEntropySessions:5] 2013-04-03 15:48:00,285 AntiEntropyService.java (line 881) [repair #dcb91540-9c75-11e2-0000-a839ee2ccbef] requesting merkle trees for Standard1 (to [/10.2.29.38, a0/10.46.113.236])
       INFO [AntiEntropyStage:1] 2013-04-03 15:48:01,710 AntiEntropyService.java (line 211) [repair #dcb91540-9c75-11e2-0000-a839ee2ccbef] Received merkle tree for Standard1 from a0/10.46.113.236
       INFO [AntiEntropyStage:1] 2013-04-03 15:48:01,943 AntiEntropyService.java (line 211) [repair #dcb91540-9c75-11e2-0000-a839ee2ccbef] Received merkle tree for Standard1 from /10.2.29.38
       INFO [AntiEntropyStage:1] 2013-04-03 15:48:02,031 AntiEntropyService.java (line 1015) [repair #dcb91540-9c75-11e2-0000-a839ee2ccbef] Endpoints a0/10.46.113.236 and /10.2.29.38 are consistent for Standard1
       INFO [AntiEntropyStage:1] 2013-04-03 15:48:02,032 AntiEntropyService.java (line 788) [repair #dcb91540-9c75-11e2-0000-a839ee2ccbef] Standard1 is fully synced
       INFO [AntiEntropySessions:5] 2013-04-03 15:48:02,032 AntiEntropyService.java (line 722) [repair #dcb91540-9c75-11e2-0000-a839ee2ccbef] session completed successfully
      
      root@ip-10-2-29-38:/home/ubuntu# grep dcb91540-9c75-11e2-0000-a839ee2ccbef /var/log/cassandra/system.log
       INFO [AntiEntropyStage:1] 2013-04-03 15:48:01,898 AntiEntropyService.java (line 244) [repair #dcb91540-9c75-11e2-0000-a839ee2ccbef] Sending completed merkle tree to /10.46.113.236 for (Keyspace1,Standard1)
      
      root@ip-10-72-111-225:/home/ubuntu# grep dcb91540-9c75-11e2-0000-a839ee2ccbef /var/log/cassandra/system.log
      root@ip-10-72-111-225:/home/ubuntu# 
      
      -------
      # nodetool -h 10.72.111.225  repair -pr Keyspace1 Standard1
      [2013-04-03 15:48:30,417] Starting repair command #1, repairing 1 ranges for keyspace Keyspace1
      [2013-04-03 15:48:30,428] Repair session eeb12670-9c75-11e2-0000-316d6fba2dbf for range (127605887595351923798765477786913079296,0] finished
      [2013-04-03 15:48:30,428] Repair command #1 finished
      
      root@ip-10-72-111-225:/home/ubuntu# grep eeb12670-9c75-11e2-0000-316d6fba2dbf /var/log/cassandra/system.log
       INFO [AntiEntropySessions:1] 2013-04-03 15:48:30,427 AntiEntropyService.java (line 676) [repair #eeb12670-9c75-11e2-0000-316d6fba2dbf] new session: will sync /10.72.111.225 on range (127605887595351923798765477786913079296,0] for Keyspace1.[Standard1]
       INFO [AntiEntropySessions:1] 2013-04-03 15:48:30,428 AntiEntropyService.java (line 681) [repair #eeb12670-9c75-11e2-0000-316d6fba2dbf] No neighbors to repair with on range (127605887595351923798765477786913079296,0]: session completed
      
      root@ip-10-46-113-236:/home/ubuntu# grep eeb12670-9c75-11e2-0000-316d6fba2dbf /var/log/cassandra/system.log
      root@ip-10-46-113-236:/home/ubuntu# 
      
      root@ip-10-2-29-38:/home/ubuntu# grep eeb12670-9c75-11e2-0000-316d6fba2dbf /var/log/cassandra/system.log
      root@ip-10-2-29-38:/home/ubuntu# 
      
      ---
      root@ip-10-2-29-38:/home/ubuntu# nodetool -h 10.2.29.38 repair Keyspace1 Standard1
      [2013-04-03 16:13:28,674] Starting repair command #2, repairing 3 ranges for keyspace Keyspace1
      [2013-04-03 16:13:31,786] Repair session 6bb81c20-9c79-11e2-0000-8b5bf6ebea9e for range (42535295865117307932921825928971026432,127605887595351923798765477786913079296] finished
      [2013-04-03 16:13:31,786] Repair session 6cb05ed0-9c79-11e2-0000-8b5bf6ebea9e for range (0,42535295865117307932921825928971026432] finished
      [2013-04-03 16:13:31,806] Repair session 6d24a470-9c79-11e2-0000-8b5bf6ebea9e for range (127605887595351923798765477786913079296,0] finished
      [2013-04-03 16:13:31,807] Repair command #2 finished
      
      root@ip-10-2-29-38:/home/ubuntu# grep 6d24a470-9c79-11e2-0000-8b5bf6ebea9e /var/log/cassandra/system.log
       INFO [AntiEntropySessions:7] 2013-04-03 16:13:31,065 AntiEntropyService.java (line 676) [repair #6d24a470-9c79-11e2-0000-8b5bf6ebea9e] new session: will sync a1/10.2.29.38, /10.46.113.236 on range (127605887595351923798765477786913079296,0] for Keyspace1.[Standard1]
       INFO [AntiEntropySessions:7] 2013-04-03 16:13:31,065 AntiEntropyService.java (line 881) [repair #6d24a470-9c79-11e2-0000-8b5bf6ebea9e] requesting merkle trees for Standard1 (to [/10.46.113.236, a1/10.2.29.38])
       INFO [AntiEntropyStage:1] 2013-04-03 16:13:31,751 AntiEntropyService.java (line 211) [repair #6d24a470-9c79-11e2-0000-8b5bf6ebea9e] Received merkle tree for Standard1 from /10.46.113.236
       INFO [AntiEntropyStage:1] 2013-04-03 16:13:31,785 AntiEntropyService.java (line 211) [repair #6d24a470-9c79-11e2-0000-8b5bf6ebea9e] Received merkle tree for Standard1 from a1/10.2.29.38
       INFO [AntiEntropyStage:1] 2013-04-03 16:13:31,805 AntiEntropyService.java (line 1015) [repair #6d24a470-9c79-11e2-0000-8b5bf6ebea9e] Endpoints /10.46.113.236 and a1/10.2.29.38 are consistent for Standard1
       INFO [AntiEntropyStage:1] 2013-04-03 16:13:31,806 AntiEntropyService.java (line 788) [repair #6d24a470-9c79-11e2-0000-8b5bf6ebea9e] Standard1 is fully synced
       INFO [AntiEntropySessions:7] 2013-04-03 16:13:31,806 AntiEntropyService.java (line 722) [repair #6d24a470-9c79-11e2-0000-8b5bf6ebea9e] session completed successfully
      
      root@ip-10-46-113-236:/home/ubuntu# grep 6d24a470-9c79-11e2-0000-8b5bf6ebea9e /var/log/cassandra/system.log 
       INFO [AntiEntropyStage:1] 2013-04-03 16:13:31,665 AntiEntropyService.java (line 244) [repair #6d24a470-9c79-11e2-0000-8b5bf6ebea9e] Sending completed merkle tree to /10.2.29.38 for (Keyspace1,Standard1)
      
      1. 5424-1.1.txt
        8 kB
        Yuki Morishita
      2. 5424-v2-1.2.txt
        6 kB
        Yuki Morishita
      3. 5424-v3-1.2.txt
        19 kB
        Yuki Morishita

        Issue Links

          Activity

          Hide
          jjordan Jeremiah Jordan added a comment -

          Tested back on 1.1.7 (before some recent repair changes) and it has the same issue.

          Show
          jjordan Jeremiah Jordan added a comment - Tested back on 1.1.7 (before some recent repair changes) and it has the same issue.
          Hide
          yukim Yuki Morishita added a comment -

          CASSANDRA-3912 changed the behavior of repair to not perform if given range is not part of the local node for Keyspace that does not have replica.
          For above case, StorageService#getLocalRanges here would return null for /10.72.111.225 and range (127605887595351923798765477786913079296,0].

          Repair always sends merkle tree request to local node and synchronizes with others, so the desired behavior would be to just send merkle tree requests to those who have replica and let them synchronize.

          Show
          yukim Yuki Morishita added a comment - CASSANDRA-3912 changed the behavior of repair to not perform if given range is not part of the local node for Keyspace that does not have replica. For above case, StorageService#getLocalRanges here would return null for /10.72.111.225 and range (127605887595351923798765477786913079296,0]. Repair always sends merkle tree request to local node and synchronizes with others, so the desired behavior would be to just send merkle tree requests to those who have replica and let them synchronize.
          Hide
          yukim Yuki Morishita added a comment -

          Patch attached against 1.1.

          It is basically rewrite of AntiEntropyService.getNeighbors, but I moved that static method to StorageService and renamed as getReplicaNodes because I felt that is more suitable place. And the method returns addresses of the replica nodes for given KS and range. Previously the method does not return the address of the local node, but the new version does only when the local node holds the replica.

          So for the above case, /10.72.111.225 sends tree request only to other nodes in Analytics DC for the range it holds, and if there is difference, the node let others to repair the data each other.

          Show
          yukim Yuki Morishita added a comment - Patch attached against 1.1. It is basically rewrite of AntiEntropyService.getNeighbors, but I moved that static method to StorageService and renamed as getReplicaNodes because I felt that is more suitable place. And the method returns addresses of the replica nodes for given KS and range. Previously the method does not return the address of the local node, but the new version does only when the local node holds the replica. So for the above case, /10.72.111.225 sends tree request only to other nodes in Analytics DC for the range it holds, and if there is difference, the node let others to repair the data each other.
          Hide
          jbellis Jonathan Ellis added a comment -

          As you know, I'm pretty leery of making anything but the most superficial changes to 1.1.x at this point.

          Am I correct that a workaround would be, "only run repair against a node that is an owner of the given range?"

          Show
          jbellis Jonathan Ellis added a comment - As you know, I'm pretty leery of making anything but the most superficial changes to 1.1.x at this point. Am I correct that a workaround would be, "only run repair against a node that is an owner of the given range?"
          Hide
          jjordan Jeremiah Jordan added a comment -

          The work around is always use repair no -pr

          Show
          jjordan Jeremiah Jordan added a comment - The work around is always use repair no -pr
          Hide
          jbellis Jonathan Ellis added a comment -

          Thinking about it, -pr really should NOT affect ranges that aren't replicated to the node in question. That's the whole point of that option!

          It looks to me like the real bug here is that repair is not NTS-aware: the "primary range" for .38 for Keyspace1 should be (127605887595351923798765477786913079296, 42535295865117307932921825928971026432], not (0, 42535295865117307932921825928971026432].

          Show
          jbellis Jonathan Ellis added a comment - Thinking about it, -pr really should NOT affect ranges that aren't replicated to the node in question. That's the whole point of that option! It looks to me like the real bug here is that repair is not NTS-aware: the "primary range" for .38 for Keyspace1 should be (127605887595351923798765477786913079296, 42535295865117307932921825928971026432], not (0, 42535295865117307932921825928971026432].
          Hide
          yukim Yuki Morishita added a comment -

          Ok, this time I created patch against 1.2.

          We've been calculating the primary range just from the tokens of the node. The patch changes this to use replication strategy's calculateNaturalEndpoint, and use the first one returned by the method as "the primary range". In order to do this in NTS though, I have to tweak a little bit(Set instead of List to use internally).

          By this way, we get the primary ranges for .38 for Keyspace1 above are (127...296, 0] and (0, 425...32]. For .225, it returns empty range(btw I had to fix repair for empty range also).
          When using vnodes, it is not guaranteed to have consecutive ranges, so I decided to return in two separate ranges.

          Show
          yukim Yuki Morishita added a comment - Ok, this time I created patch against 1.2. We've been calculating the primary range just from the tokens of the node. The patch changes this to use replication strategy's calculateNaturalEndpoint, and use the first one returned by the method as "the primary range". In order to do this in NTS though, I have to tweak a little bit(Set instead of List to use internally). By this way, we get the primary ranges for .38 for Keyspace1 above are (127...296, 0] and (0, 425...32]. For .225, it returns empty range(btw I had to fix repair for empty range also). When using vnodes, it is not guaranteed to have consecutive ranges, so I decided to return in two separate ranges.
          Hide
          jbellis Jonathan Ellis added a comment -

          Some questions:

          • Were we relying on the Set behavior to de-duplicate entries in replicas before copying it into an ArrayList at the end, or was that just a case of being over-cautious?
          • Why don't we need to check ranges.size > 0 any more in forceRepairAsync?
          • Do we need to fix other uses of tokenMetadata.getPrimaryRangesFor such as SS.sampleKeyRange?
          • Can we use getCachedEndpoints instead of calculateNaturalEndpoints?

          Also:

          • It's probably worth adding some comments to getPrimaryRangesForEndpoint – superficially, it looks like it is incorrect since it is still using the non-Strategy-aware metadata.getPredecessor, but after working some examples I am satisfied that it does the right thing, as it does here.
          Show
          jbellis Jonathan Ellis added a comment - Some questions: Were we relying on the Set behavior to de-duplicate entries in replicas before copying it into an ArrayList at the end, or was that just a case of being over-cautious? Why don't we need to check ranges.size > 0 any more in forceRepairAsync ? Do we need to fix other uses of tokenMetadata.getPrimaryRangesFor such as SS.sampleKeyRange ? Can we use getCachedEndpoints instead of calculateNaturalEndpoints ? Also: It's probably worth adding some comments to getPrimaryRangesForEndpoint – superficially, it looks like it is incorrect since it is still using the non-Strategy-aware metadata.getPredecessor , but after working some examples I am satisfied that it does the right thing, as it does here.
          Hide
          yukim Yuki Morishita added a comment -

          Were we relying on the Set behavior to de-duplicate entries in replicas before copying it into an ArrayList at the end, or was that just a case of being over-cautious?

          hmm, I think we need to check if we have duplicates.

          Why don't we need to check ranges.size > 0 any more in forceRepairAsync?

          I added 'isEmpty' check at the beginning instead. Without that, repair command hangs on client side.

          Do we need to fix other uses of tokenMetadata.getPrimaryRangesFor such as SS.sampleKeyRange?

          I was not sure if we need to fix. It looks like sampleKeyRange is only used by nodetool.

          Can we use getCachedEndpoints instead of calculateNaturalEndpoints?

          Probably we can use getNaturalEndpoints, which uses cached endpoints.

          I'll brush up my patch with comments and unit tests.

          Show
          yukim Yuki Morishita added a comment - Were we relying on the Set behavior to de-duplicate entries in replicas before copying it into an ArrayList at the end, or was that just a case of being over-cautious? hmm, I think we need to check if we have duplicates. Why don't we need to check ranges.size > 0 any more in forceRepairAsync? I added 'isEmpty' check at the beginning instead. Without that, repair command hangs on client side. Do we need to fix other uses of tokenMetadata.getPrimaryRangesFor such as SS.sampleKeyRange? I was not sure if we need to fix. It looks like sampleKeyRange is only used by nodetool. Can we use getCachedEndpoints instead of calculateNaturalEndpoints? Probably we can use getNaturalEndpoints, which uses cached endpoints. I'll brush up my patch with comments and unit tests.
          Hide
          jbellis Jonathan Ellis added a comment -

          It looks like sampleKeyRange is only used by nodetool

          It's a minor problem (looks like it's mostly there to support OPP: CASSANDRA-2917) but we should probably fix it.

          Also, it looks like Bootstrap is using it to determine where to bisect ranges. We should fix that one way or another (where "another" might be "get rid of token selection on bootstrap and force people to either use vnodes or specify token manually"). Separate ticket as followup is fine here IMO.

          Show
          jbellis Jonathan Ellis added a comment - It looks like sampleKeyRange is only used by nodetool It's a minor problem (looks like it's mostly there to support OPP: CASSANDRA-2917 ) but we should probably fix it. Also, it looks like Bootstrap is using it to determine where to bisect ranges. We should fix that one way or another (where "another" might be "get rid of token selection on bootstrap and force people to either use vnodes or specify token manually"). Separate ticket as followup is fine here IMO.
          Hide
          jbellis Jonathan Ellis added a comment -

          get rid of token selection on bootstrap and force people to either use vnodes or specify token manually

          To clarify: this would be best done in 2.0.

          Show
          jbellis Jonathan Ellis added a comment - get rid of token selection on bootstrap and force people to either use vnodes or specify token manually To clarify: this would be best done in 2.0.
          Hide
          yukim Yuki Morishita added a comment - - edited

          v3 attached.

          • NTS now uses LinkedHashSet in calculateNaturalEndpoint to preserve insertion order while eliminating duplicates.
          • I think it is unsafe to use cached endpoints through getNaturalEndpoints since tokenMetadata cannot be consistent inside getPrimaryRangesForEndpoint, so I stick with impl from v2.
          • fix sampleKeyRange. I think the problem is that the name tokenMetadata.getPrimaryRangeFor is confusing. Probably we should rename that to just getRangeFor.
          • Added test for getPrimaryRangesForEndpoint to StorageServiceServerTest.
          Show
          yukim Yuki Morishita added a comment - - edited v3 attached. NTS now uses LinkedHashSet in calculateNaturalEndpoint to preserve insertion order while eliminating duplicates. I think it is unsafe to use cached endpoints through getNaturalEndpoints since tokenMetadata cannot be consistent inside getPrimaryRangesForEndpoint, so I stick with impl from v2. fix sampleKeyRange. I think the problem is that the name tokenMetadata.getPrimaryRangeFor is confusing. Probably we should rename that to just getRangeFor. Added test for getPrimaryRangesForEndpoint to StorageServiceServerTest.
          Hide
          jbellis Jonathan Ellis added a comment -

          I think this is fine the way it was:

          -        if (ranges.size() > 0)
          -        {
          -            new Thread(createRepairTask(cmd, keyspace, ranges, isSequential, isLocal, columnFamilies)).start();
          -        }
          +        new Thread(createRepairTask(cmd, keyspace, ranges, isSequential, isLocal, columnFamilies)).start();
          

          Otherwise LGTM. Created CASSANDRA-5499 for followup.

          Show
          jbellis Jonathan Ellis added a comment - I think this is fine the way it was: - if (ranges.size() > 0) - { - new Thread (createRepairTask(cmd, keyspace, ranges, isSequential, isLocal, columnFamilies)).start(); - } + new Thread (createRepairTask(cmd, keyspace, ranges, isSequential, isLocal, columnFamilies)).start(); Otherwise LGTM. Created CASSANDRA-5499 for followup.
          Hide
          yukim Yuki Morishita added a comment -

          Committed with above fix. Thanks!

          Show
          yukim Yuki Morishita added a comment - Committed with above fix. Thanks!
          Hide
          rcoli Robert Coli added a comment -

          get rid of token selection on bootstrap and force people to either use vnodes or specify token manually

          This has seemed operationally sane to me since approximately 0.6 series. We gain almost nothing (noobs will really be discouraged by having to set a token manually?) and expose ourselves to unnecessary complexity and edge cases like this. +1

          Show
          rcoli Robert Coli added a comment - get rid of token selection on bootstrap and force people to either use vnodes or specify token manually This has seemed operationally sane to me since approximately 0.6 series. We gain almost nothing (noobs will really be discouraged by having to set a token manually?) and expose ourselves to unnecessary complexity and edge cases like this. +1
          Hide
          jbellis Jonathan Ellis added a comment -

          Done in CASSANDRA-5518.

          Show
          jbellis Jonathan Ellis added a comment - Done in CASSANDRA-5518 .
          Hide
          alprema Kévin LOVATO added a comment - - edited

          We just applied 1.2.5 on our cluster and the repair hanging is fixed, but the -pr is still not working as expected.
          Our cluster has two datacenters, let's call them dc1 and dc2, we created a Keyspace Test_Replication with replication factor { dc1: 3 } (no info for dc2) and ran a nodetool repair Test_Replication (that used to hang) on dc2 and it exited saying there was nothing to do (which is OK).
          Then we changed the replication factor to { dc1: 3, dc2: 3 } and started a nodetool repair -pr Test_Replication on cassandra11@dc2 which output this:

          user@cassandra11:~$ nodetool repair -pr Test_Replication
          [2013-06-03 13:54:53,948] Starting repair command #1, repairing 1 ranges for keyspace Test_Replication
          [2013-06-03 13:54:53,985] Repair session 676c00f0-cc44-11e2-bfd5-3d9212e452cc for range (0,1] finished
          [2013-06-03 13:54:53,985] Repair command #1 finished
          

          But even after flushing the Keyspace, there was no data on the server.
          We then ran a full repair:

          user@cassandra11:~$ nodetool repair  Test_Replication
          [2013-06-03 14:01:56,679] Starting repair command #2, repairing 6 ranges for keyspace Test_Replication
          [2013-06-03 14:01:57,260] Repair session 63632d70-cc45-11e2-bfd5-3d9212e452cc for range (0,1] finished
          [2013-06-03 14:01:57,260] Repair session 63650230-cc45-11e2-bfd5-3d9212e452cc for range (56713727820156410577229101238628035243,113427455640312821154458202477256070484] finished
          [2013-06-03 14:01:57,260] Repair session 6385d0a0-cc45-11e2-bfd5-3d9212e452cc for range (1,56713727820156410577229101238628035242] finished
          [2013-06-03 14:01:57,260] Repair session 639f7320-cc45-11e2-bfd5-3d9212e452cc for range (56713727820156410577229101238628035242,56713727820156410577229101238628035243] finished
          [2013-06-03 14:01:57,260] Repair session 63af51a0-cc45-11e2-bfd5-3d9212e452cc for range (113427455640312821154458202477256070484,113427455640312821154458202477256070485] finished
          [2013-06-03 14:01:57,295] Repair session 63b12660-cc45-11e2-bfd5-3d9212e452cc for range (113427455640312821154458202477256070485,0] finished
          [2013-06-03 14:01:57,295] Repair command #2 finished
          

          After which we could find the data on dc2 as expected.

          So it seems that -pr is still not working as expected, or maybe we're doing/understanding something wrong.
          (I was not sure if I should open a new ticket or comment this one so please let me know if I should move it)

          Show
          alprema Kévin LOVATO added a comment - - edited We just applied 1.2.5 on our cluster and the repair hanging is fixed, but the -pr is still not working as expected. Our cluster has two datacenters, let's call them dc1 and dc2, we created a Keyspace Test_Replication with replication factor { dc1: 3 } (no info for dc2) and ran a nodetool repair Test_Replication (that used to hang) on dc2 and it exited saying there was nothing to do (which is OK). Then we changed the replication factor to { dc1: 3, dc2: 3 } and started a nodetool repair -pr Test_Replication on cassandra11@dc2 which output this: user@cassandra11:~$ nodetool repair -pr Test_Replication [2013-06-03 13:54:53,948] Starting repair command #1, repairing 1 ranges for keyspace Test_Replication [2013-06-03 13:54:53,985] Repair session 676c00f0-cc44-11e2-bfd5-3d9212e452cc for range (0,1] finished [2013-06-03 13:54:53,985] Repair command #1 finished But even after flushing the Keyspace, there was no data on the server. We then ran a full repair: user@cassandra11:~$ nodetool repair Test_Replication [2013-06-03 14:01:56,679] Starting repair command #2, repairing 6 ranges for keyspace Test_Replication [2013-06-03 14:01:57,260] Repair session 63632d70-cc45-11e2-bfd5-3d9212e452cc for range (0,1] finished [2013-06-03 14:01:57,260] Repair session 63650230-cc45-11e2-bfd5-3d9212e452cc for range (56713727820156410577229101238628035243,113427455640312821154458202477256070484] finished [2013-06-03 14:01:57,260] Repair session 6385d0a0-cc45-11e2-bfd5-3d9212e452cc for range (1,56713727820156410577229101238628035242] finished [2013-06-03 14:01:57,260] Repair session 639f7320-cc45-11e2-bfd5-3d9212e452cc for range (56713727820156410577229101238628035242,56713727820156410577229101238628035243] finished [2013-06-03 14:01:57,260] Repair session 63af51a0-cc45-11e2-bfd5-3d9212e452cc for range (113427455640312821154458202477256070484,113427455640312821154458202477256070485] finished [2013-06-03 14:01:57,295] Repair session 63b12660-cc45-11e2-bfd5-3d9212e452cc for range (113427455640312821154458202477256070485,0] finished [2013-06-03 14:01:57,295] Repair command #2 finished After which we could find the data on dc2 as expected. So it seems that -pr is still not working as expected, or maybe we're doing/understanding something wrong. (I was not sure if I should open a new ticket or comment this one so please let me know if I should move it)
          Hide
          jbellis Jonathan Ellis added a comment -

          What should happen is that if you repair -pr on each node in dc2, then you will repair the full token space. But for a single node, YMMV. In particular, it's quite possible that this is correct:

          Repair session 676c00f0-cc44-11e2-bfd5-3d9212e452cc for range (0,1] finished

          Note the tiny range involved. (This indicates that your dc2 tokens are not balanced, btw.)

          Show
          jbellis Jonathan Ellis added a comment - What should happen is that if you repair -pr on each node in dc2, then you will repair the full token space. But for a single node, YMMV. In particular, it's quite possible that this is correct: Repair session 676c00f0-cc44-11e2-bfd5-3d9212e452cc for range (0,1] finished Note the tiny range involved. (This indicates that your dc2 tokens are not balanced, btw.)
          Hide
          jbellis Jonathan Ellis added a comment -

          This indicates that your dc2 tokens are not balanced, btw

          Hmm. Actually I don't see how repair could generate only a single range in a 2-DC setup and NTS. Can you post your ring?

          Show
          jbellis Jonathan Ellis added a comment - This indicates that your dc2 tokens are not balanced, btw Hmm. Actually I don't see how repair could generate only a single range in a 2-DC setup and NTS. Can you post your ring?
          Hide
          jjordan Jeremiah Jordan added a comment -

          With the following replication:

          { dc1: 3, dc2: 3 }
          

          And the following ring:

          node dc  token
          n0   dc1 0
          n1   dc2 1
          

          That is the expected output from "nodetool -h n1 repair -pr". Do a "nodetool -h n0 repair -pr" and n1 will get a bunch of data. -pr only repairs from current token to previous token, if you don't have any data with a token of "1", then repair -pr won't do much for repairing n1.

          Show
          jjordan Jeremiah Jordan added a comment - With the following replication: { dc1: 3, dc2: 3 } And the following ring: node dc token n0 dc1 0 n1 dc2 1 That is the expected output from "nodetool -h n1 repair -pr". Do a "nodetool -h n0 repair -pr" and n1 will get a bunch of data. -pr only repairs from current token to previous token, if you don't have any data with a token of "1", then repair -pr won't do much for repairing n1.
          Hide
          jbellis Jonathan Ellis added a comment - - edited

          I should have said, 2-DC setup, NTS, and replicas in both DC. And more than one node in each DC.

          In any case, I do see the problem now. Working on a fix.

          Show
          jbellis Jonathan Ellis added a comment - - edited I should have said, 2-DC setup, NTS, and replicas in both DC. And more than one node in each DC. In any case, I do see the problem now. Working on a fix.
          Hide
          jjordan Jeremiah Jordan added a comment -

          If there is a problem, glad you found it, but I don't see how multiple nodes changes the fact that the primary range of n1 is only (0,1] if both DC's have replicas.

          Show
          jjordan Jeremiah Jordan added a comment - If there is a problem, glad you found it, but I don't see how multiple nodes changes the fact that the primary range of n1 is only (0,1] if both DC's have replicas.
          Hide
          alprema Kévin LOVATO added a comment - - edited

          [EDIT] I didn't see your latests posts before posting, but I hope the extra data can help anyway

          You were right to say that I need to run the repair -pr on the three nodes, because I only have one row (it's a test) in the CF so I guess I had to run the repair -pr on the node in charge of this key.
          But I restarted my test and did the repair on all three nodes, and it didn't work either; here's the output:

          user@cassandra11:~$ nodetool repair -pr Test_Replication
          [2013-06-03 13:54:53,948] Starting repair command #1, repairing 1 ranges for keyspace Test_Replication
          [2013-06-03 13:54:53,985] Repair session 676c00f0-cc44-11e2-bfd5-3d9212e452cc for range (0,1] finished
          [2013-06-03 13:54:53,985] Repair command #1 finished
          
          user@cassandra12:~$ nodetool repair -pr Test_Replication
          [2013-06-03 17:33:17,844] Starting repair command #1, repairing 1 ranges for keyspace Test_Replication
          [2013-06-03 17:33:17,866] Repair session e9f38c50-cc62-11e2-af47-db8ca926a9c5 for range (56713727820156410577229101238628035242,56713727820156410577229101238628035243] finished
          [2013-06-03 17:33:17,866] Repair command #1 finished
          
          user@cassandra13:~$ nodetool repair -pr Test_Replication
          [2013-06-03 17:33:29,689] Starting repair command #1, repairing 1 ranges for keyspace Test_Replication
          [2013-06-03 17:33:29,712] Repair session f102f3a0-cc62-11e2-ae98-39da3e693be3 for range (113427455640312821154458202477256070484,113427455640312821154458202477256070485] finished
          [2013-06-03 17:33:29,712] Repair command #1 finished
          

          The data is still not copied to the new datacenter, and I don't understand why the repair is made for those ranges (a range of 1??), it could be a problem of unbalanced cluster as you suggested, but we distributed the tokens as advised (+1 on the nodes of the new datacenter) as you can see in the following nodetool status:

          user@cassandra13:~$ nodetool status
          Datacenter: dc1
          =====================
          Status=Up/Down
          |/ State=Normal/Leaving/Joining/Moving
          --  Address         Load       Owns   Host ID                               Token                                    Rac
          UN  cassandra01     102 GB     33.3%  fa7672f5-77f0-4b41-b9d1-13bf63c39122  0                                        RC1
          UN  cassandra02     88.73 GB   33.3%  c799df22-0873-4a99-a901-5ef5b00b7b1e  56713727820156410577229101238628035242   RC1
          UN  cassandra03     50.86 GB   33.3%  5b9c6bc4-7ec7-417d-b92d-c5daa787201b  113427455640312821154458202477256070484  RC1
          Datacenter: dc2
          ======================
          Status=Up/Down
          |/ State=Normal/Leaving/Joining/Moving
          --  Address         Load       Owns   Host ID                               Token                                    Rac
          UN  cassandra11     51.21 GB   0.0%   7b610455-3fd2-48a3-9315-895a4609be42  1                                        RC2
          UN  cassandra12     45.02 GB   0.0%   8553f2c0-851c-4af2-93ee-2854c96de45a  56713727820156410577229101238628035243   RC2
          UN  cassandra13     36.8 GB    0.0%   7f537660-9128-4c13-872a-6e026104f30e  113427455640312821154458202477256070485  RC2
          

          Furthermore the full repair works, as you can see in this log:

          user@cassandra11:~$ nodetool repair  Test_Replication
          [2013-06-03 17:44:07,570] Starting repair command #5, repairing 6 ranges for keyspace Test_Replication
          [2013-06-03 17:44:07,903] Repair session 6d37b720-cc64-11e2-bfd5-3d9212e452cc for range (0,1] finished
          [2013-06-03 17:44:07,903] Repair session 6d3a0110-cc64-11e2-bfd5-3d9212e452cc for range (56713727820156410577229101238628035243,113427455640312821154458202477256070484] finished
          [2013-06-03 17:44:07,903] Repair session 6d4d6200-cc64-11e2-bfd5-3d9212e452cc for range (1,56713727820156410577229101238628035242] finished
          [2013-06-03 17:44:07,903] Repair session 6d581060-cc64-11e2-bfd5-3d9212e452cc for range (56713727820156410577229101238628035242,56713727820156410577229101238628035243] finished
          [2013-06-03 17:44:07,903] Repair session 6d5ea010-cc64-11e2-bfd5-3d9212e452cc for range (113427455640312821154458202477256070484,113427455640312821154458202477256070485] finished
          [2013-06-03 17:44:07,934] Repair session 6d604dc0-cc64-11e2-bfd5-3d9212e452cc for range (113427455640312821154458202477256070485,0] finished
          [2013-06-03 17:44:07,934] Repair command #5 finished
          

          I hope this information can help, please let me know if you think it's a configuration issue, in which case I would talk to the mailing list.

          Show
          alprema Kévin LOVATO added a comment - - edited [EDIT] I didn't see your latests posts before posting, but I hope the extra data can help anyway You were right to say that I need to run the repair -pr on the three nodes, because I only have one row (it's a test) in the CF so I guess I had to run the repair -pr on the node in charge of this key. But I restarted my test and did the repair on all three nodes, and it didn't work either; here's the output: user@cassandra11:~$ nodetool repair -pr Test_Replication [2013-06-03 13:54:53,948] Starting repair command #1, repairing 1 ranges for keyspace Test_Replication [2013-06-03 13:54:53,985] Repair session 676c00f0-cc44-11e2-bfd5-3d9212e452cc for range (0,1] finished [2013-06-03 13:54:53,985] Repair command #1 finished user@cassandra12:~$ nodetool repair -pr Test_Replication [2013-06-03 17:33:17,844] Starting repair command #1, repairing 1 ranges for keyspace Test_Replication [2013-06-03 17:33:17,866] Repair session e9f38c50-cc62-11e2-af47-db8ca926a9c5 for range (56713727820156410577229101238628035242,56713727820156410577229101238628035243] finished [2013-06-03 17:33:17,866] Repair command #1 finished user@cassandra13:~$ nodetool repair -pr Test_Replication [2013-06-03 17:33:29,689] Starting repair command #1, repairing 1 ranges for keyspace Test_Replication [2013-06-03 17:33:29,712] Repair session f102f3a0-cc62-11e2-ae98-39da3e693be3 for range (113427455640312821154458202477256070484,113427455640312821154458202477256070485] finished [2013-06-03 17:33:29,712] Repair command #1 finished The data is still not copied to the new datacenter, and I don't understand why the repair is made for those ranges (a range of 1??), it could be a problem of unbalanced cluster as you suggested, but we distributed the tokens as advised (+1 on the nodes of the new datacenter) as you can see in the following nodetool status: user@cassandra13:~$ nodetool status Datacenter: dc1 ===================== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Owns Host ID Token Rac UN cassandra01 102 GB 33.3% fa7672f5-77f0-4b41-b9d1-13bf63c39122 0 RC1 UN cassandra02 88.73 GB 33.3% c799df22-0873-4a99-a901-5ef5b00b7b1e 56713727820156410577229101238628035242 RC1 UN cassandra03 50.86 GB 33.3% 5b9c6bc4-7ec7-417d-b92d-c5daa787201b 113427455640312821154458202477256070484 RC1 Datacenter: dc2 ====================== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Owns Host ID Token Rac UN cassandra11 51.21 GB 0.0% 7b610455-3fd2-48a3-9315-895a4609be42 1 RC2 UN cassandra12 45.02 GB 0.0% 8553f2c0-851c-4af2-93ee-2854c96de45a 56713727820156410577229101238628035243 RC2 UN cassandra13 36.8 GB 0.0% 7f537660-9128-4c13-872a-6e026104f30e 113427455640312821154458202477256070485 RC2 Furthermore the full repair works, as you can see in this log: user@cassandra11:~$ nodetool repair Test_Replication [2013-06-03 17:44:07,570] Starting repair command #5, repairing 6 ranges for keyspace Test_Replication [2013-06-03 17:44:07,903] Repair session 6d37b720-cc64-11e2-bfd5-3d9212e452cc for range (0,1] finished [2013-06-03 17:44:07,903] Repair session 6d3a0110-cc64-11e2-bfd5-3d9212e452cc for range (56713727820156410577229101238628035243,113427455640312821154458202477256070484] finished [2013-06-03 17:44:07,903] Repair session 6d4d6200-cc64-11e2-bfd5-3d9212e452cc for range (1,56713727820156410577229101238628035242] finished [2013-06-03 17:44:07,903] Repair session 6d581060-cc64-11e2-bfd5-3d9212e452cc for range (56713727820156410577229101238628035242,56713727820156410577229101238628035243] finished [2013-06-03 17:44:07,903] Repair session 6d5ea010-cc64-11e2-bfd5-3d9212e452cc for range (113427455640312821154458202477256070484,113427455640312821154458202477256070485] finished [2013-06-03 17:44:07,934] Repair session 6d604dc0-cc64-11e2-bfd5-3d9212e452cc for range (113427455640312821154458202477256070485,0] finished [2013-06-03 17:44:07,934] Repair command #5 finished I hope this information can help, please let me know if you think it's a configuration issue, in which case I would talk to the mailing list.
          Hide
          jjordan Jeremiah Jordan added a comment -

          Kévin LOVATO you need to run it on all 6 nodes. repair -pr only repairs the primary range, when ever you use repair -pr you must run repair on every node which owns the data for the KS you are repairing. If the KS is only in DC1, that is 3 nodes, if it is in DC1 and DC2 that is 6 nodes.

          Show
          jjordan Jeremiah Jordan added a comment - Kévin LOVATO you need to run it on all 6 nodes. repair -pr only repairs the primary range, when ever you use repair -pr you must run repair on every node which owns the data for the KS you are repairing. If the KS is only in DC1, that is 3 nodes, if it is in DC1 and DC2 that is 6 nodes.
          Hide
          jbellis Jonathan Ellis added a comment -

          I was right the first time; this is correct behavior. Quoting from CASSANDRA-5608:

          The right way to use -pr is still to repair everywhere the data exists; if we made -pr affect everything in the DC regardless of other replicas, then repairing the full cluster would repair each range 1x for each DC, which is not what we want

          Show
          jbellis Jonathan Ellis added a comment - I was right the first time; this is correct behavior. Quoting from CASSANDRA-5608 : The right way to use -pr is still to repair everywhere the data exists; if we made -pr affect everything in the DC regardless of other replicas, then repairing the full cluster would repair each range 1x for each DC, which is not what we want
          Hide
          alprema Kévin LOVATO added a comment - - edited

          I redid the same test (creating the keyspace with data, then changing its replication factor so it's replicated in DC2, then repairing) and it turns out that if you don't run a repair on DC2 before changing the replication factor, the repair -pr works fine -_-.

          Anyway, your solution worked, thank you for your help and sorry I polluted JIRA with my questions.

          Show
          alprema Kévin LOVATO added a comment - - edited I redid the same test (creating the keyspace with data, then changing its replication factor so it's replicated in DC2, then repairing) and it turns out that if you don't run a repair on DC2 before changing the replication factor, the repair -pr works fine -_-. Anyway, your solution worked, thank you for your help and sorry I polluted JIRA with my questions.

            People

            • Assignee:
              yukim Yuki Morishita
              Reporter:
              jjordan Jeremiah Jordan
              Reviewer:
              Jonathan Ellis
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development