Cassandra
  1. Cassandra
  2. CASSANDRA-5171

Save EC2Snitch topology information in system table

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Fix Version/s: 2.0 beta 2
    • Component/s: Core
    • Labels:
      None
    • Environment:

      EC2

      Description

      EC2Snitch currently waits for the Gossip information to understand the cluster information every time we restart. It will be nice to use already available system table info similar to GPFS.

        Activity

        Hide
        Brandon Williams added a comment -

        +1

        Show
        Brandon Williams added a comment - +1
        Hide
        Vijay added a comment -

        Committed to 1.2 and trunk. Thanks!

        Show
        Vijay added a comment - Committed to 1.2 and trunk. Thanks!
        Hide
        Jonathan Ellis added a comment -

        This was reverted in CASSANDRA-5432, but I think the problem it solves is actually pretty severe, so I'm reopening it.

        The problem is that pretty much everything from TokenMetadata to NetworkTopologyStrategy assumes that once we see a node, the snitch can tell us where it lives, and in particular that once the snitch tells us where a node lives it won't change its answer.

        So this is problematic:

            public String getDatacenter(InetAddress endpoint)
            {
                if (endpoint.equals(FBUtilities.getBroadcastAddress()))
                    return ec2region;
                EndpointState state = Gossiper.instance.getEndpointStateForEndpoint(endpoint);
                if (state == null || state.getApplicationState(ApplicationState.DC) == null)
                    return DEFAULT_DC;
                return state.getApplicationState(ApplicationState.DC).value;
            }
        

        That is, if we don't know where a node belongs (e.g., we just restarted and haven't been gosipped to yet), assume it's in DEFAULT_DC.

        This can lead to data loss. Consider node X in DC1, where keyspace KS is replicated. Suddenly X is yanked out of DC1 and placed in DC2, where KS is not replicated. Nobody will bother querying X for the data in KS that was formerly replicated to it. Even repair will not see it.

        Show
        Jonathan Ellis added a comment - This was reverted in CASSANDRA-5432 , but I think the problem it solves is actually pretty severe, so I'm reopening it. The problem is that pretty much everything from TokenMetadata to NetworkTopologyStrategy assumes that once we see a node, the snitch can tell us where it lives, and in particular that once the snitch tells us where a node lives it won't change its answer. So this is problematic: public String getDatacenter(InetAddress endpoint) { if (endpoint.equals(FBUtilities.getBroadcastAddress())) return ec2region; EndpointState state = Gossiper.instance.getEndpointStateForEndpoint(endpoint); if (state == null || state.getApplicationState(ApplicationState.DC) == null ) return DEFAULT_DC; return state.getApplicationState(ApplicationState.DC).value; } That is, if we don't know where a node belongs (e.g., we just restarted and haven't been gosipped to yet), assume it's in DEFAULT_DC . This can lead to data loss. Consider node X in DC1, where keyspace KS is replicated. Suddenly X is yanked out of DC1 and placed in DC2, where KS is not replicated. Nobody will bother querying X for the data in KS that was formerly replicated to it. Even repair will not see it.
        Hide
        Vijay added a comment -

        Attached patch brings the reverted patch back to life, in addition it saves the reseted_ip in system table so when a connection is reopened to the host we will use the reseted_ip instead of endpoint address.

        Show
        Vijay added a comment - Attached patch brings the reverted patch back to life, in addition it saves the reseted_ip in system table so when a connection is reopened to the host we will use the reseted_ip instead of endpoint address.
        Hide
        Vijay added a comment -

        Looks like CASSANDRA-5669 forces the user to open public and private IP's for communication (and hence CASSANDRA-5432 no longer a problem) within a AZ and hence v2 is not needed and we can commit v1.

        Show
        Vijay added a comment - Looks like CASSANDRA-5669 forces the user to open public and private IP's for communication (and hence CASSANDRA-5432 no longer a problem) within a AZ and hence v2 is not needed and we can commit v1.
        Hide
        Jason Brown added a comment - - edited

        While this patch (v1 actually) was reverted in CASANDRA-5432, it wasn't satisfactorily answered why the patch failed to work as expected. I'm adding details here so we can get this ticket done right .

        First it's helpful to explore how a node can start gossip in EC2MRS with inter-DC (inter-region) enabled (and a Priam-type setup).

        1. ec2 instance is started, Priam comes up first and adds publicIP/sslPort to the security group's ingress privileges (so this node can accept connections on it's publicIP/sslPort from anywhere).
        2. c* starts, and gets seed node public hostnames from Priam
        3. gossip to one of the seeds - the public hostname will resolve to the node's public IP addr.
        4. When OTC goes to write the first message on the seed, it gets a socket from OTCP.newSocket(). newSocket() calls isEncryptedChannel() to determine if we need to encrypt the data on the wire. As we don't know anything yet about the seed node (remember we havn't started gossip yet with anyone), isEncryptedChannel() will always return true when the following are true:
          1. internode_encryption != none
          2. we don't know the DC or RACK info for the remote node (which is the case when using the EC2MRS). This step is a little funky as OTCP calls the snitch for the seed's DC/RACK, to which EC2MRS will return UNKNOWN-DC/UNKNOWN-RACK, which will just happen to not match a value like "us-east-1" (the current's node's DC).
        5. create the socket using remote node's publicIP addr on the SSL port.
        6. create the connection from and send messages successfully, assuming you've opened the SSL port for public addresses on the security group (which Priam handles).

        Thus, if we are connecting to a node in the same EC2 region, we connect on the publicIP (as expected) but use the SSL port.

        After we learn, via gossip, about a remote node's DC/RACK/localIP, we can choose to reconnect to nodes in the same region on the localIP/nonSSLPort.

        The reason why Vijay's patch had problems here was because on restart, we would already know the DC/RACK from the previous execution of c* on this node, and the check in OTCP.isEncryptedChannel() returns false (do not use encryption), so a we choose to use the non-SSL port when creating a connection to the publicIP. Thus the connection creation unltimately fails because the non-SSL port is not opened for traffic on the security group for the public IP (nor should it be). EDIT: The other part of the problem is that we start the connection on the publicIP rather than localIP (INTERNAL_IP) even if we already have the localIP.

        To make this patch work then, I think getting the localIP address in the OTCP's ctor would work the best. Code would look something like this:

            OutboundTcpConnectionPool(InetAddress remoteEp)
            {
                EndpointState epState =  Gossiper.instance.getEndpointStateForEndpoint(remoteEp);
                if(epState != null && epState.getApplicationState(ApplicationState.INTERNAL_IP) != null
                    && epState.getApplicationState(ApplicationState.DC).equals(snitch.getDatacenter(FBUtilities.getBroadcastAddress()))
                {
                    id = epState.getApplicationState(ApplicationState.INTERNAL_IP);             
                }
                else
                {
                    id = remoteEp;
                }
                
                cmdCon = new OutboundTcpConnection(this);
                cmdCon.start();
                ackCon = new OutboundTcpConnection(this);
                ackCon.start();
        
                metrics = new ConnectionMetrics(id, this);
            }
        

        Then you would connect on the localIP addr with the correct port (SSL or non-SSL).

        Show
        Jason Brown added a comment - - edited While this patch (v1 actually) was reverted in CASANDRA-5432, it wasn't satisfactorily answered why the patch failed to work as expected. I'm adding details here so we can get this ticket done right . First it's helpful to explore how a node can start gossip in EC2MRS with inter-DC (inter-region) enabled (and a Priam-type setup). ec2 instance is started, Priam comes up first and adds publicIP/sslPort to the security group's ingress privileges (so this node can accept connections on it's publicIP/sslPort from anywhere). c* starts, and gets seed node public hostnames from Priam gossip to one of the seeds - the public hostname will resolve to the node's public IP addr. When OTC goes to write the first message on the seed, it gets a socket from OTCP.newSocket(). newSocket() calls isEncryptedChannel() to determine if we need to encrypt the data on the wire. As we don't know anything yet about the seed node (remember we havn't started gossip yet with anyone), isEncryptedChannel() will always return true when the following are true: internode_encryption != none we don't know the DC or RACK info for the remote node (which is the case when using the EC2MRS). This step is a little funky as OTCP calls the snitch for the seed's DC/RACK, to which EC2MRS will return UNKNOWN-DC/UNKNOWN-RACK, which will just happen to not match a value like "us-east-1" (the current's node's DC). create the socket using remote node's publicIP addr on the SSL port. create the connection from and send messages successfully, assuming you've opened the SSL port for public addresses on the security group (which Priam handles). Thus, if we are connecting to a node in the same EC2 region, we connect on the publicIP (as expected) but use the SSL port. After we learn, via gossip, about a remote node's DC/RACK/localIP, we can choose to reconnect to nodes in the same region on the localIP/nonSSLPort. The reason why Vijay's patch had problems here was because on restart, we would already know the DC/RACK from the previous execution of c* on this node, and the check in OTCP.isEncryptedChannel() returns false (do not use encryption), so a we choose to use the non-SSL port when creating a connection to the publicIP. Thus the connection creation unltimately fails because the non-SSL port is not opened for traffic on the security group for the public IP (nor should it be). EDIT: The other part of the problem is that we start the connection on the publicIP rather than localIP (INTERNAL_IP) even if we already have the localIP. To make this patch work then, I think getting the localIP address in the OTCP's ctor would work the best. Code would look something like this: OutboundTcpConnectionPool(InetAddress remoteEp) { EndpointState epState = Gossiper.instance.getEndpointStateForEndpoint(remoteEp); if (epState != null && epState.getApplicationState(ApplicationState.INTERNAL_IP) != null && epState.getApplicationState(ApplicationState.DC).equals(snitch.getDatacenter(FBUtilities.getBroadcastAddress())) { id = epState.getApplicationState(ApplicationState.INTERNAL_IP); } else { id = remoteEp; } cmdCon = new OutboundTcpConnection( this ); cmdCon.start(); ackCon = new OutboundTcpConnection( this ); ackCon.start(); metrics = new ConnectionMetrics(id, this ); } Then you would connect on the localIP addr with the correct port (SSL or non-SSL).
        Hide
        Jason Brown added a comment -

        Ahh, didn't look at Vijay's second patch, but it more or less does what I suggested - however, I thought we were already keeping the resettedAddr in the system table, but the second patch adds that in.

        Show
        Jason Brown added a comment - Ahh, didn't look at Vijay's second patch, but it more or less does what I suggested - however, I thought we were already keeping the resettedAddr in the system table, but the second patch adds that in.
        Hide
        Jason Brown added a comment -

        I'm +1 on the v2 patch. As a minor nit, you might rename the "communication_ip" to something like "preferred_ip".

        As an aside, I've never been thrilled with the default "UNKNOWN-DC" that can get returned when we don't know the DC. It's guaranteed to be wrong in almost all cases. However, I'm not sure what we'd do instead - returning null or empty string seems only slightly worse than the current default.

        Show
        Jason Brown added a comment - I'm +1 on the v2 patch. As a minor nit, you might rename the "communication_ip" to something like "preferred_ip". As an aside, I've never been thrilled with the default "UNKNOWN-DC" that can get returned when we don't know the DC. It's guaranteed to be wrong in almost all cases. However, I'm not sure what we'd do instead - returning null or empty string seems only slightly worse than the current default.
        Hide
        Vijay added a comment -

        I've never been thrilled with the default "UNKNOWN-DC"

        I am not a fan either, if we throw an exception node will never start the gossiping....
        Back in the days, it made sense with the old way of versioning etc. We should probably open another ticket to discuses this if needed.

        I will commit with the nit in few min. Thanks!

        Show
        Vijay added a comment - I've never been thrilled with the default "UNKNOWN-DC" I am not a fan either, if we throw an exception node will never start the gossiping.... Back in the days, it made sense with the old way of versioning etc. We should probably open another ticket to discuses this if needed. I will commit with the nit in few min. Thanks!
        Hide
        Vijay added a comment -

        PS: i only committed to 2.0 to be safe, let me know if you think otherwise. Thanks!

        Show
        Vijay added a comment - PS: i only committed to 2.0 to be safe, let me know if you think otherwise. Thanks!
        Hide
        Jason Brown added a comment - - edited

        I thought the patch was reasonable enough for 1.2. If you give me a few days, I can test it out in our env, and let you know if it borks everything or not.

        EDIT: Yeah, will definitely want to test to make sure it's cool with CASSANDRA-5669 (i.e. they don't collide to not connect at all).

        Show
        Jason Brown added a comment - - edited I thought the patch was reasonable enough for 1.2. If you give me a few days, I can test it out in our env, and let you know if it borks everything or not. EDIT: Yeah, will definitely want to test to make sure it's cool with CASSANDRA-5669 (i.e. they don't collide to not connect at all).
        Hide
        Vijay added a comment -

        Thanks Jason!

        Show
        Vijay added a comment - Thanks Jason!
        Hide
        Jason Brown added a comment -

        damn jira hotkeys - sorry this got reassigned (to me, and back), Vijay

        Show
        Jason Brown added a comment - damn jira hotkeys - sorry this got reassigned (to me, and back), Vijay

          People

          • Assignee:
            Vijay
            Reporter:
            Vijay
            Reviewer:
            Brandon Williams
          • Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development