Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-5432

Repair Freeze/Gossip Invisibility Issues 1.2.4

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Fix Version/s: 1.2.5
    • Component/s: None
    • Labels:
      None
    • Environment:

      Ubuntu 10.04.1 LTS
      C* 1.2.3
      Sun Java 6 u43
      JNA Enabled
      Not using VNodes

      Description

      Read comment 6. This description summarizes the repair issue only, but I believe there is a bigger problem going on with networking as described on that comment.

      Since I have upgraded our sandbox cluster, I am unable to run repair on any node and I am reaching our gc_grace seconds this weekend. Please help. So far, I have tried the following suggestions:

      • nodetool scrub
      • offline scrub
      • running repair on each CF separately. Didn't matter. All got stuck the same way.

      The repair command just gets stuck and the machine is idling. Only the following logs are printed for repair job:

      INFO [Thread-42214] 2013-04-05 23:30:27,785 StorageService.java (line 2379) Starting repair command #4, repairing 1 ranges for keyspace cardspring_production
      INFO [AntiEntropySessions:7] 2013-04-05 23:30:27,789 AntiEntropyService.java (line 652) repair #cc5a9aa0-9e48-11e2-98ba-11bde7670242 new session: will sync /X.X.X.190, /X.X.X.43, /X.X.X.56 on range (1808575600,42535295865117307932921825930779602032] for keyspace_production.[comma separated list of CFs]
      INFO [AntiEntropySessions:7] 2013-04-05 23:30:27,790 AntiEntropyService.java (line 858) repair #cc5a9aa0-9e48-11e2-98ba-11bde7670242 requesting merkle trees for BusinessConnectionIndicesEntries (to [/X.X.X.43, /X.X.X.56, /X.X.X.190])
      INFO [AntiEntropyStage:1] 2013-04-05 23:30:28,086 AntiEntropyService.java (line 214) repair #cc5a9aa0-9e48-11e2-98ba-11bde7670242 Received merkle tree for ColumnFamilyName from /X.X.X.43
      INFO [AntiEntropyStage:1] 2013-04-05 23:30:28,147 AntiEntropyService.java (line 214) repair #cc5a9aa0-9e48-11e2-98ba-11bde7670242 Received merkle tree for ColumnFamilyName from /X.X.X.56

      Please advise.

        Issue Links

          Activity

          Hide
          slebresne Sylvain Lebresne added a comment -

          Took the liberty to commit as I want to re-roll 1.2.5.

          Show
          slebresne Sylvain Lebresne added a comment - Took the liberty to commit as I want to re-roll 1.2.5.
          Hide
          brandon.williams Brandon Williams added a comment -

          I never thought CASSANDRA-5171 was a really big gain anyway, but it looked innocuous enough at the time. +1 on reverting it.

          Show
          brandon.williams Brandon Williams added a comment - I never thought CASSANDRA-5171 was a really big gain anyway, but it looked innocuous enough at the time. +1 on reverting it.
          Hide
          vijay2win@yahoo.com Vijay added a comment -

          The problem is that we need Private_ip to communicate within DC/region is not available until the gossiping with nodes.
          Since we dont have the private information but we do have the rest (DC/RACK), we are trying to connect via public IP.

          Removing that optimization forces us to assume it is in other DC and hence using public IP and SSL port, eventually when we receive the private IP we reset the status to use the right (private_ip) connection.
          You may ask why not store the private IP? well we could but currently the reset connection (to private IP) logic is in the snitch.

          Show
          vijay2win@yahoo.com Vijay added a comment - The problem is that we need Private_ip to communicate within DC/region is not available until the gossiping with nodes. Since we dont have the private information but we do have the rest (DC/RACK), we are trying to connect via public IP. Removing that optimization forces us to assume it is in other DC and hence using public IP and SSL port, eventually when we receive the private IP we reset the status to use the right (private_ip) connection. You may ask why not store the private IP? well we could but currently the reset connection (to private IP) logic is in the snitch.
          Hide
          jbellis Jonathan Ellis added a comment -

          Why does "let's use the last-known location of this node" cause problems?

          Show
          jbellis Jonathan Ellis added a comment - Why does "let's use the last-known location of this node" cause problems?
          Hide
          arya Arya Goudarzi added a comment - - edited

          +1 works for me. Thank you.

          Show
          arya Arya Goudarzi added a comment - - edited +1 works for me. Thank you.
          Hide
          arya Arya Goudarzi added a comment -

          Sure, I should be able to get back to you either tonight or tomorrow.

          Show
          arya Arya Goudarzi added a comment - Sure, I should be able to get back to you either tonight or tomorrow.
          Hide
          vijay2win@yahoo.com Vijay added a comment -

          attached reverts CASSANDRA-5171 and adds handling of local endpoint.

          Arya, mind testing this?

          Show
          vijay2win@yahoo.com Vijay added a comment - attached reverts CASSANDRA-5171 and adds handling of local endpoint. Arya, mind testing this?
          Hide
          ondrej.cernos Ondřej Černoš added a comment -

          Please see also CASSANDRA-5493 - the MessagingService also reports dropped messages on itself using it's public IP. The output displays 3 public IPs and 2 private (the private IP of the node itself is not included), while the remote DC is reported correctly. This seems related.

          Show
          ondrej.cernos Ondřej Černoš added a comment - Please see also CASSANDRA-5493 - the MessagingService also reports dropped messages on itself using it's public IP. The output displays 3 public IPs and 2 private (the private IP of the node itself is not included), while the remote DC is reported correctly. This seems related.
          Hide
          ondrej.cernos Ondřej Černoš added a comment -

          I have exactly the same issue as Arya.

          I also had to open non-SSL ports from within the datacenter in order to create the cluster.

          I was wondering if it could be a networking issue (we use mixed aws-private cloud setup), so it is good to see we are not alone with this.

          Show
          ondrej.cernos Ondřej Černoš added a comment - I have exactly the same issue as Arya. I also had to open non-SSL ports from within the datacenter in order to create the cluster. I was wondering if it could be a networking issue (we use mixed aws-private cloud setup), so it is good to see we are not alone with this.
          Hide
          arya Arya Goudarzi added a comment - - edited

          So, I rolled back CASSANDRA-5171. Pushed it to my test cluster. The gossip issue where nodes after restart didn't see each other got fixed. The repair still tried to connect to the machine running repair (self) with its public IP for requesting MerkleTree where it gets stuck, so it has the same issue. Some behavior changed though, and the OutBoundTCPConnection didn't report connecting to other 2 replicas for requesting MerkleTree, so I only saw the message when trying to connect. Here is the snippet:

          INFO [Thread-458] 2013-04-24 23:21:16,543 StorageService.java (line 2407) Starting repair command #1, repairing 1 ranges for keyspace app_production
          DEBUG [Thread-458] 2013-04-24 23:21:16,580 StorageService.java (line 2547) computing ranges for 1808575600, 7089215977519551322153637656637080005, 14178431955039102644307275311465584410, 4253529586511
          7307932921825930779602030, 49624511842636859255075463585608106435, 56713727820156410577229101240436610840, 85070591730234615865843651859750628460, 92159807707754167187997289514579132865, 9924902368527
          3718510150927169407637270, 127605887595351923798765477788721654890, 134695103572871475120919115443550159295, 141784319550391026443072753098378663700
          INFO [AntiEntropySessions:1] 2013-04-24 23:21:16,587 AntiEntropyService.java (line 651) repair #a9a87e40-ad35-11e2-945a-050d956ff11b new session: will sync /YYY.XX.98.11, /YY.XXX.107.137, /YY.XXX.133.163 on range (99249023685273718510150927169407637270,127605887595351923798765477788721654890] for cardspring_production.[App]
          INFO [AntiEntropySessions:1] 2013-04-24 23:21:16,598 AntiEntropyService.java (line 857) repair #a9a87e40-ad35-11e2-945a-050d956ff11b requesting merkle trees for App (to [/XX.YYY.107.137, /XX.YYY.133.163, /XXX.YY.98.11])
          DEBUG [WRITE-/107.20.98.11] 2013-04-24 23:21:16,601 OutboundTcpConnection.java (line 260) attempting to connect to /XXX.YY.98.11
          INFO [AntiEntropyStage:1] 2013-04-24 23:21:19,111 AntiEntropyService.java (line 213) repair #a9a87e40-ad35-11e2-945a-050d956ff11b Received merkle tree for App from /XX.YYY.133.163
          DEBUG [ScheduledTasks:1] 2013-04-24 23:21:19,409 GCInspector.java (line 121) GC for ParNew: 54 ms for 1 collections, 669806384 used; max is 4211081216
          INFO [AntiEntropyStage:1] 2013-04-24 23:21:20,408 AntiEntropyService.java (line 213) repair #a9a87e40-ad35-11e2-945a-050d956ff11b Received merkle tree for App from /XX.YYY.107.137

          See the debug line with OutboundTcpConnection. It is trying to connect to public IP of self (XXX.YY.98.11), which is still an issue. What I was expecting to see before this line was two other consecutive lines like before where it showed OutboundTcpConnection trying to connect to other nodes as well. Despite them returning the MerkleTrees, those log lines did not show. So, connection was made successfully to the other nodes somehow.

          Show
          arya Arya Goudarzi added a comment - - edited So, I rolled back CASSANDRA-5171 . Pushed it to my test cluster. The gossip issue where nodes after restart didn't see each other got fixed. The repair still tried to connect to the machine running repair (self) with its public IP for requesting MerkleTree where it gets stuck, so it has the same issue. Some behavior changed though, and the OutBoundTCPConnection didn't report connecting to other 2 replicas for requesting MerkleTree, so I only saw the message when trying to connect. Here is the snippet: INFO [Thread-458] 2013-04-24 23:21:16,543 StorageService.java (line 2407) Starting repair command #1, repairing 1 ranges for keyspace app_production DEBUG [Thread-458] 2013-04-24 23:21:16,580 StorageService.java (line 2547) computing ranges for 1808575600, 7089215977519551322153637656637080005, 14178431955039102644307275311465584410, 4253529586511 7307932921825930779602030, 49624511842636859255075463585608106435, 56713727820156410577229101240436610840, 85070591730234615865843651859750628460, 92159807707754167187997289514579132865, 9924902368527 3718510150927169407637270, 127605887595351923798765477788721654890, 134695103572871475120919115443550159295, 141784319550391026443072753098378663700 INFO [AntiEntropySessions:1] 2013-04-24 23:21:16,587 AntiEntropyService.java (line 651) repair #a9a87e40-ad35-11e2-945a-050d956ff11b new session: will sync /YYY.XX.98.11, /YY.XXX.107.137, /YY.XXX.133.163 on range (99249023685273718510150927169407637270,127605887595351923798765477788721654890] for cardspring_production. [App] INFO [AntiEntropySessions:1] 2013-04-24 23:21:16,598 AntiEntropyService.java (line 857) repair #a9a87e40-ad35-11e2-945a-050d956ff11b requesting merkle trees for App (to [/XX.YYY.107.137, /XX.YYY.133.163, /XXX.YY.98.11] ) DEBUG [WRITE-/107.20.98.11] 2013-04-24 23:21:16,601 OutboundTcpConnection.java (line 260) attempting to connect to /XXX.YY.98.11 INFO [AntiEntropyStage:1] 2013-04-24 23:21:19,111 AntiEntropyService.java (line 213) repair #a9a87e40-ad35-11e2-945a-050d956ff11b Received merkle tree for App from /XX.YYY.133.163 DEBUG [ScheduledTasks:1] 2013-04-24 23:21:19,409 GCInspector.java (line 121) GC for ParNew: 54 ms for 1 collections, 669806384 used; max is 4211081216 INFO [AntiEntropyStage:1] 2013-04-24 23:21:20,408 AntiEntropyService.java (line 213) repair #a9a87e40-ad35-11e2-945a-050d956ff11b Received merkle tree for App from /XX.YYY.107.137 See the debug line with OutboundTcpConnection. It is trying to connect to public IP of self (XXX.YY.98.11), which is still an issue. What I was expecting to see before this line was two other consecutive lines like before where it showed OutboundTcpConnection trying to connect to other nodes as well. Despite them returning the MerkleTrees, those log lines did not show. So, connection was made successfully to the other nodes somehow.
          Hide
          arya Arya Goudarzi added a comment -

          I was actually suspicious about that. I can roll back that patch and try it. Give me till end of the week. My hands are tied up right now.

          Show
          arya Arya Goudarzi added a comment - I was actually suspicious about that. I can roll back that patch and try it. Give me till end of the week. My hands are tied up right now.
          Hide
          vijay2win@yahoo.com Vijay added a comment -

          Priam opens port for other DC's to talk to each other but nothing to do within, i still doubt the SG setup coz all IP's within a security group should be opened for both ports.
          May be CASSANDRA-5171 created a side effect, which i am not sure.

          Jason Brown do you mind verifying it with 1.2.4? Verifying it with Priam is a bigger undertaking for me now

          Show
          vijay2win@yahoo.com Vijay added a comment - Priam opens port for other DC's to talk to each other but nothing to do within, i still doubt the SG setup coz all IP's within a security group should be opened for both ports. May be CASSANDRA-5171 created a side effect, which i am not sure. Jason Brown do you mind verifying it with 1.2.4? Verifying it with Priam is a bigger undertaking for me now
          Hide
          arya Arya Goudarzi added a comment -

          Priam only opens one port, and that is the SSL port on public IPs (see line 74): http://goo.gl/vY8WX

          I did not remove the IPs from security group. I left the IP rules for the SSL port as were set by Priam. I only remove the NON SSL port rules on public IPs which I had added manually to work around this issue.

          Show
          arya Arya Goudarzi added a comment - Priam only opens one port, and that is the SSL port on public IPs (see line 74): http://goo.gl/vY8WX I did not remove the IPs from security group. I left the IP rules for the SSL port as were set by Priam. I only remove the NON SSL port rules on public IPs which I had added manually to work around this issue.
          Hide
          vijay2win@yahoo.com Vijay added a comment - - edited

          Hi Arya, Thanks and you can call me anytime but it will help others if we keep the discussion here.

          Has this always been the case?

          As far as i know, yes.

          I go to security groups and remove the non SSL on public IP rules that I added in previous step.

          I think you should not remove the IP's. Priam opens up ports for the local nodes and also the remote nodes within the security group (http://goo.gl/l9Q1T). Looks like you shouldn't do the above because you are now disabling cassandra from restarting the connections.

          Also the reason you are seeing all the nodes to be UP in a multi region case event though they cannot communicate within the DC is because of the issue mentioned in CASSANDRA-3533, I can almost bet that the read/write requests will be failing in the local DC, If not try after restarting nodes.

          Let me know if you still have issues or disagree.

          Show
          vijay2win@yahoo.com Vijay added a comment - - edited Hi Arya, Thanks and you can call me anytime but it will help others if we keep the discussion here. Has this always been the case? As far as i know, yes. I go to security groups and remove the non SSL on public IP rules that I added in previous step. I think you should not remove the IP's. Priam opens up ports for the local nodes and also the remote nodes within the security group ( http://goo.gl/l9Q1T ). Looks like you shouldn't do the above because you are now disabling cassandra from restarting the connections. Also the reason you are seeing all the nodes to be UP in a multi region case event though they cannot communicate within the DC is because of the issue mentioned in CASSANDRA-3533 , I can almost bet that the read/write requests will be failing in the local DC, If not try after restarting nodes. Let me know if you still have issues or disagree.
          Hide
          arya Arya Goudarzi added a comment - - edited

          Hey Vijay,

          Good to see you here. Sorry if my analysis is unclear. Here is my take:

          > The first time we start the communication to a node we try to Initiate communications we use the public IP and eventually once we have the private IP we will switch back to local ip's.

          Has this always been the case? Because if you are using public ips (not public dns name), there has to be explicit security rules on public ips to allow this. Otherwise, if in security groups you are opening the ports to the machines in the same group using their security group name, it allows traffic only within their private ips, so this won't work.

          We use Priam (your awesome tooling), and as you know, it opens up only the SSL port on the public IPs for cross region communication. And from the operator's perspective, that is the correct thing to do. I only have the SSL port open on public IPs and don't want to open the non SSL port for security reasons. Now, all other ports like non SSL, JMX, etc are opened the way I described using security group names and it allows traffic on private IPs. It is just the way AWS has been. So, if within the same region, you are trying to connect to any machine using public ip, it won't work.

          Here is how I achieved the scenario above and I believe they are all co-related to the statement you said that all machine connect to public IPs first.

          Setup a cluster as I described in my previous comment. It can be a single region. Restart all machines at the same time. Each machine would only see itself as UP. Everyone else is reported to be DOWN in nodetool ring. I am guessing that it is because they are trying to send gossips to public IPs but only SSL port is open on public IPs. The cluster is configured to only do SSL cross datacenter/region not within the same region. So, now I am left with bunch of nodes that only see themselves in the ring. I go to my AWS console, open up the non SSL port on every single public IP in that security group. Now all the nodes see each other.

          By now, I had a theory about nodes wanting to communicate through the public ip which is not possible, so I stepped into troubleshooting repairs. I know that with current settings repair would succeed. Since the nodes see each other now, I go to security groups and remove the non SSL on public IP rules that I added in previous step. Start the repair, and I ended up with the log message as above. The public ip mentioned in the log, belongs to the node that owns the log and is running repair, so it tried to communicated to itself using its own public IP.

          Did I make sense? I can call you to describe it over the phone, but basically this setup used to work on 1.1.10 but does not work on 1.2.4. I have attached the debugger to a node and am trying to trace the code. I'll let you know if I find something new.

          Show
          arya Arya Goudarzi added a comment - - edited Hey Vijay, Good to see you here. Sorry if my analysis is unclear. Here is my take: > The first time we start the communication to a node we try to Initiate communications we use the public IP and eventually once we have the private IP we will switch back to local ip's. Has this always been the case? Because if you are using public ips (not public dns name), there has to be explicit security rules on public ips to allow this. Otherwise, if in security groups you are opening the ports to the machines in the same group using their security group name, it allows traffic only within their private ips, so this won't work. We use Priam (your awesome tooling), and as you know, it opens up only the SSL port on the public IPs for cross region communication. And from the operator's perspective, that is the correct thing to do. I only have the SSL port open on public IPs and don't want to open the non SSL port for security reasons. Now, all other ports like non SSL, JMX, etc are opened the way I described using security group names and it allows traffic on private IPs. It is just the way AWS has been. So, if within the same region, you are trying to connect to any machine using public ip, it won't work. Here is how I achieved the scenario above and I believe they are all co-related to the statement you said that all machine connect to public IPs first. Setup a cluster as I described in my previous comment. It can be a single region. Restart all machines at the same time. Each machine would only see itself as UP. Everyone else is reported to be DOWN in nodetool ring. I am guessing that it is because they are trying to send gossips to public IPs but only SSL port is open on public IPs. The cluster is configured to only do SSL cross datacenter/region not within the same region. So, now I am left with bunch of nodes that only see themselves in the ring. I go to my AWS console, open up the non SSL port on every single public IP in that security group. Now all the nodes see each other. By now, I had a theory about nodes wanting to communicate through the public ip which is not possible, so I stepped into troubleshooting repairs. I know that with current settings repair would succeed. Since the nodes see each other now, I go to security groups and remove the non SSL on public IP rules that I added in previous step. Start the repair, and I ended up with the log message as above. The public ip mentioned in the log, belongs to the node that owns the log and is running repair, so it tried to communicated to itself using its own public IP. Did I make sense? I can call you to describe it over the phone, but basically this setup used to work on 1.1.10 but does not work on 1.2.4. I have attached the debugger to a node and am trying to trace the code. I'll let you know if I find something new.
          Hide
          vijay2win@yahoo.com Vijay added a comment -

          Arya,
          The first time we start the communication to a node we try to Initiate communications we use the public IP and eventually once we have the private IP we will switch back to local ip's.

          I am confused with the analysis, because the nodes should have been connected and communicating and Tree request is another message in the same channel as any other message.
          Are the nodes up in the first place?

                          this.treeRequests = new RequestCoordinator<TreeRequest>(isSequential)
                          {
                              public void send(TreeRequest r)
                              {
                                  MessagingService.instance().sendOneWay(r.createMessage(), r.endpoint);
                              }
                          };
          
          Show
          vijay2win@yahoo.com Vijay added a comment - Arya, The first time we start the communication to a node we try to Initiate communications we use the public IP and eventually once we have the private IP we will switch back to local ip's. I am confused with the analysis, because the nodes should have been connected and communicating and Tree request is another message in the same channel as any other message. Are the nodes up in the first place? this .treeRequests = new RequestCoordinator<TreeRequest>(isSequential) { public void send(TreeRequest r) { MessagingService.instance().sendOneWay(r.createMessage(), r.endpoint); } };
          Hide
          arya Arya Goudarzi added a comment - - edited

          > non-ssl on the private IP within the same one [region]

          OK, a little more digging, and I found the root cause which I believe is a bug, so I am re-opening this.

          See this log snippet for a repair sessions I triggered on a node in a single region in AWS:

          INFO [AntiEntropySessions:1] 2013-04-19 04:28:16,587 AntiEntropyService.java (line 651) repair #8e59b7c0-a8a9-11e2-ba85-d39d57f66b97 new session: will sync /54.242.X.YYY, /54.224.XX.YYY, /50.17.XXX.YYY on range (99249023685273718510150927169407637270,127605887595351923798765477788721654890] for cardspring_production.[App]
          INFO [AntiEntropySessions:1] 2013-04-19 04:28:16,591 AntiEntropyService.java (line 857) repair #8e59b7c0-a8a9-11e2-ba85-d39d57f66b97 requesting merkle trees for App (to [/54.224.XX.YYY, /50.17.XXX.YYY, /54.242.X.YYY])
          DEBUG [WRITE-/50.17.159.210] 2013-04-19 04:28:16,592 OutboundTcpConnection.java (line 260) attempting to connect to /10.170.XX.YYY
          DEBUG [WRITE-/54.224.36.214] 2013-04-19 04:28:16,593 OutboundTcpConnection.java (line 260) attempting to connect to /10.121.XX.YYY
          DEBUG [WRITE-/54.242.1.111] 2013-04-19 04:28:16,593 OutboundTcpConnection.java (line 260) attempting to connect to /54.242.X.YYY

          Notice the last line. This is the public IP of the node running repair. Why is this picking up the public ip address for itself to send the tree request? This is the source of problem. In AWS you cannot communicated through public ip address with security group rules that are defined based on group names in the same region, which is a common use case. Hence the tree request gets stuck at sending point to itself.

          Show
          arya Arya Goudarzi added a comment - - edited > non-ssl on the private IP within the same one [region] OK, a little more digging, and I found the root cause which I believe is a bug, so I am re-opening this. See this log snippet for a repair sessions I triggered on a node in a single region in AWS: INFO [AntiEntropySessions:1] 2013-04-19 04:28:16,587 AntiEntropyService.java (line 651) repair #8e59b7c0-a8a9-11e2-ba85-d39d57f66b97 new session: will sync /54.242.X.YYY, /54.224.XX.YYY, /50.17.XXX.YYY on range (99249023685273718510150927169407637270,127605887595351923798765477788721654890] for cardspring_production. [App] INFO [AntiEntropySessions:1] 2013-04-19 04:28:16,591 AntiEntropyService.java (line 857) repair #8e59b7c0-a8a9-11e2-ba85-d39d57f66b97 requesting merkle trees for App (to [/54.224.XX.YYY, /50.17.XXX.YYY, /54.242.X.YYY] ) DEBUG [WRITE-/50.17.159.210] 2013-04-19 04:28:16,592 OutboundTcpConnection.java (line 260) attempting to connect to /10.170.XX.YYY DEBUG [WRITE-/54.224.36.214] 2013-04-19 04:28:16,593 OutboundTcpConnection.java (line 260) attempting to connect to /10.121.XX.YYY DEBUG [WRITE-/54.242.1.111] 2013-04-19 04:28:16,593 OutboundTcpConnection.java (line 260) attempting to connect to /54.242.X.YYY Notice the last line. This is the public IP of the node running repair. Why is this picking up the public ip address for itself to send the tree request? This is the source of problem. In AWS you cannot communicated through public ip address with security group rules that are defined based on group names in the same region, which is a common use case. Hence the tree request gets stuck at sending point to itself.
          Hide
          jbellis Jonathan Ellis added a comment -

          You said above that you had it configured this way in 1.1 as well:

          7100 from cluster1 (Configured Normal Storage)
          7103 from cluster1 (Configured SSL Storage)

          In any case, it is not a bug for you to need both open; Cassandra will use SSL between datacenters (regions), and non-ssl on the private IP within the same one.

          Show
          jbellis Jonathan Ellis added a comment - You said above that you had it configured this way in 1.1 as well: 7100 from cluster1 (Configured Normal Storage) 7103 from cluster1 (Configured SSL Storage) In any case, it is not a bug for you to need both open; Cassandra will use SSL between datacenters (regions), and non-ssl on the private IP within the same one.
          Hide
          arya Arya Goudarzi added a comment -

          I added a correction. It is not JMX Jonathan, you are right. It is opening the non-ssl storage port on public IPs that fixes it. We didn't have to do this on 1.1.10.

          Show
          arya Arya Goudarzi added a comment - I added a correction. It is not JMX Jonathan, you are right. It is opening the non-ssl storage port on public IPs that fixes it. We didn't have to do this on 1.1.10.
          Hide
          arya Arya Goudarzi added a comment -

          I have used the IRC channel already. It was suggested to me to open a JIRA ticket as no one could help.

          Show
          arya Arya Goudarzi added a comment - I have used the IRC channel already. It was suggested to me to open a JIRA ticket as no one could help.
          Hide
          jbellis Jonathan Ellis added a comment -

          Gossip does not touch JMX. JMX is not used internally at all; it's only there to let nodetool invoke methods.

          Please see the user mailing list for troubleshooting help, Jira is not a good place for that.

          Show
          jbellis Jonathan Ellis added a comment - Gossip does not touch JMX. JMX is not used internally at all; it's only there to let nodetool invoke methods. Please see the user mailing list for troubleshooting help, Jira is not a good place for that.
          Hide
          arya Arya Goudarzi added a comment - - edited

          I narrowed this down to non-ssl storage port, and they must be opened on the public IPs. Here are the steps to reproduce:

          This is a working configuration:
          Cassandra 1.1.10 Cluster with 12 nodes in us-east-1 and 12 nodes in us-west-2
          Using Ec2MultiRegionSnitch and SSL enabled for DC_ONLY and NetworkTopologyStrategy with strategy_options: us-east-1:3;us-west-2:3;
          C* instances have a security group called 'cluster1'
          security group 'cluster1' in each region is configured as such
          Allow TCP:
          7199 from cluster1 (JMX)
          1024 - 65535 from cluster1 (JMX Random Ports)
          7100 from cluster1 (Configured Normal Storage)
          7103 from cluster1 (Configured SSL Storage)
          9160 from cluster1 (Configured Thrift RPC Port)
          9160 from <client_group>
          foreach node's public IP we also have this rule set to enable cross region comminication:
          7103 from public_ip

          The above is a functioning and happy setup. You run repair, and it finishes successfully.

          Broken Setup:

          Upgrade to 1.2.4 without changing any of the above security group settings:

          Run repair. The repair will not receive the Merkle Tree for itself. Thus hanging. See description. The test in description was done with one region with strategy of us-east-1:3, but other settings were exactly the same.

          Now for each public_ip add a security group rule as such to cluster1 security group:

          Allow TCP: 7100 from public_ip

          Run repair. Things will magically work now.

          If nothing in terms of port and networking has changed in 1.2, then why the above is happening? I can constantly reproduce it.

          This also affects gossip. If you don't have the JMX Ports open on public ips, then gossip would not see any node except itself after a snap restart of all nodes all at once.

          Show
          arya Arya Goudarzi added a comment - - edited I narrowed this down to non-ssl storage port, and they must be opened on the public IPs. Here are the steps to reproduce: This is a working configuration: Cassandra 1.1.10 Cluster with 12 nodes in us-east-1 and 12 nodes in us-west-2 Using Ec2MultiRegionSnitch and SSL enabled for DC_ONLY and NetworkTopologyStrategy with strategy_options: us-east-1:3;us-west-2:3; C* instances have a security group called 'cluster1' security group 'cluster1' in each region is configured as such Allow TCP: 7199 from cluster1 (JMX) 1024 - 65535 from cluster1 (JMX Random Ports) 7100 from cluster1 (Configured Normal Storage) 7103 from cluster1 (Configured SSL Storage) 9160 from cluster1 (Configured Thrift RPC Port) 9160 from <client_group> foreach node's public IP we also have this rule set to enable cross region comminication: 7103 from public_ip The above is a functioning and happy setup. You run repair, and it finishes successfully. Broken Setup: Upgrade to 1.2.4 without changing any of the above security group settings: Run repair. The repair will not receive the Merkle Tree for itself. Thus hanging. See description. The test in description was done with one region with strategy of us-east-1:3, but other settings were exactly the same. Now for each public_ip add a security group rule as such to cluster1 security group: Allow TCP: 7100 from public_ip Run repair. Things will magically work now. If nothing in terms of port and networking has changed in 1.2, then why the above is happening? I can constantly reproduce it. This also affects gossip. If you don't have the JMX Ports open on public ips, then gossip would not see any node except itself after a snap restart of all nodes all at once.
          Hide
          brandon.williams Brandon Williams added a comment -

          Nothing changed with port usage. There are standard ways to see what ports a process is using to check this, though.

          Show
          brandon.williams Brandon Williams added a comment - Nothing changed with port usage. There are standard ways to see what ports a process is using to check this, though.
          Hide
          arya Arya Goudarzi added a comment - - edited

          OK, I found the problem, but something is changed in this release regarding the networking that is not clear to me. I use EC2. I had to open all TCP ports to the world for the repairs to work. They didn't even work when I allowed all TCP within our C*'s security group. This is not acceptable as it is a security risk. What was changed in 1.2.3 in terms of repair routing? Shouldn't it just use the storage port?

          We use Ec2MultiRegionSnitch, so it returns DNS that resolved to local ips for in-region communication and public ips for cross-region communication. I have a C* 1.1.10 cluster in production and it is working fine without having to open the security group wide open.

          Please advice.

          Show
          arya Arya Goudarzi added a comment - - edited OK, I found the problem, but something is changed in this release regarding the networking that is not clear to me. I use EC2. I had to open all TCP ports to the world for the repairs to work. They didn't even work when I allowed all TCP within our C*'s security group. This is not acceptable as it is a security risk. What was changed in 1.2.3 in terms of repair routing? Shouldn't it just use the storage port? We use Ec2MultiRegionSnitch, so it returns DNS that resolved to local ips for in-region communication and public ips for cross-region communication. I have a C* 1.1.10 cluster in production and it is working fine without having to open the security group wide open. Please advice.
          Hide
          arya Arya Goudarzi added a comment -

          I really need to get this going before sunday. I also looked at all other nodes logs, and nothing interesting.

          Show
          arya Arya Goudarzi added a comment - I really need to get this going before sunday. I also looked at all other nodes logs, and nothing interesting.
          Hide
          arya Arya Goudarzi added a comment -

          Thanks Yuki. node x.x.x.190 is the node which I triggered repair on (self). The logs above belong to that node and stop right there. There is no error or exception.

          Show
          arya Arya Goudarzi added a comment - Thanks Yuki. node x.x.x.190 is the node which I triggered repair on (self). The logs above belong to that node and stop right there. There is no error or exception.
          Hide
          yukim Yuki Morishita added a comment -

          If that's the only log you get so far, then the node is waiting merkle tree response from /x.x.x.190.
          Check if you have any error on that node.

          Show
          yukim Yuki Morishita added a comment - If that's the only log you get so far, then the node is waiting merkle tree response from /x.x.x.190. Check if you have any error on that node.

            People

            • Assignee:
              vijay2win@yahoo.com Vijay
              Reporter:
              arya Arya Goudarzi
              Reviewer:
              Brandon Williams
            • Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development