Cassandra
  1. Cassandra
  2. CASSANDRA-5102

upgrading from 1.1.7 to 1.2.0 caused upgraded nodes to only know about other 1.2.0 nodes

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Fix Version/s: 1.2.1
    • Component/s: None
    • Labels:
      None

      Description

      I upgraded as I have since 0.86 and things didn't go very smoothly.

      I did a nodetool drain to my 1.1.7 node and changed my puppet config to use the new merged config. When it came back up (without any errors in the log) a nodetool ring only showed itself. I upgraded another node and sure enough now nodetool ring showed two nodes.

      I tried resetting the local schema. The upgraded node happily grabbed the schema again but still only 1.2 nodes were visible in the ring to any upgraded nodes.

      1. 5102.txt
        2 kB
        Brandon Williams

        Activity

        Hide
        Brandon Williams added a comment -

        Can you add nodetool gossipinfo from one of the new nodes that doesn't see the old ones? Do the old ones see the new nodes?

        Show
        Brandon Williams added a comment - Can you add nodetool gossipinfo from one of the new nodes that doesn't see the old ones? Do the old ones see the new nodes?
        Hide
        Michael Kjellman added a comment -

        i'm using PropertyFileSnitch as my endpoint snitch btw

        Show
        Michael Kjellman added a comment - i'm using PropertyFileSnitch as my endpoint snitch btw
        Hide
        Michael Kjellman added a comment -

        at this point i just went ahead and upgraded all the nodes (was more worried about getting the cluster back up)

        I do notice though that the 1.2.0 nodes show net_version of 6.

        as nodes were upgraded to 1.2.0 they didn't show up in the ring on the 1.1.7 side anymore.

        gossipinfo on 1.2.0 nodes (ubuntu 12.04) look like:

        /10.8.30.14
        RELEASE_VERSION:1.2.0
        NET_VERSION:6
        RPC_ADDRESS:0.0.0.0
        HOST_ID:24647d52-41eb-4df3-993e-51d4f841ca62
        LOAD:2.0129361318E11
        STATUS:NORMAL,70892159775195513221536376548285044050
        DC:DC1
        SCHEMA:da921e0b-4154-3601-9c76-6f61ca5f2872
        RACK:RAC1
        SEVERITY:-3.991605743852711E-11
        /10.8.25.101
        RELEASE_VERSION:1.2.0
        RPC_ADDRESS:0.0.0.0
        NET_VERSION:6
        HOST_ID:dd3a40e2-fef1-4574-87b8-e2929fd80235
        LOAD:1.56018171896E11
        STATUS:NORMAL,42535295865117307932921825928971026436
        DC:DC1
        SCHEMA:da921e0b-4154-3601-9c76-6f61ca5f2872
        RACK:RAC2
        SEVERITY:0.019533560597218058

        Show
        Michael Kjellman added a comment - at this point i just went ahead and upgraded all the nodes (was more worried about getting the cluster back up) I do notice though that the 1.2.0 nodes show net_version of 6. as nodes were upgraded to 1.2.0 they didn't show up in the ring on the 1.1.7 side anymore. gossipinfo on 1.2.0 nodes (ubuntu 12.04) look like: /10.8.30.14 RELEASE_VERSION:1.2.0 NET_VERSION:6 RPC_ADDRESS:0.0.0.0 HOST_ID:24647d52-41eb-4df3-993e-51d4f841ca62 LOAD:2.0129361318E11 STATUS:NORMAL,70892159775195513221536376548285044050 DC:DC1 SCHEMA:da921e0b-4154-3601-9c76-6f61ca5f2872 RACK:RAC1 SEVERITY:-3.991605743852711E-11 /10.8.25.101 RELEASE_VERSION:1.2.0 RPC_ADDRESS:0.0.0.0 NET_VERSION:6 HOST_ID:dd3a40e2-fef1-4574-87b8-e2929fd80235 LOAD:1.56018171896E11 STATUS:NORMAL,42535295865117307932921825928971026436 DC:DC1 SCHEMA:da921e0b-4154-3601-9c76-6f61ca5f2872 RACK:RAC2 SEVERITY:0.019533560597218058
        Hide
        Brandon Williams added a comment -

        Did they show up on the 1.1 side as being down, or not at all?

        Show
        Brandon Williams added a comment - Did they show up on the 1.1 side as being down, or not at all?
        Hide
        Michael Kjellman added a comment -

        not at all.

        Show
        Michael Kjellman added a comment - not at all.
        Hide
        Michael Kjellman added a comment -

        very possibly unrelated but opscenter seems to be unable to identify the nodes in the cluster either:

        2013-01-02 18:39:24-0800 [] WARN: Unable to find a matching cluster for [u'fe80
        :0:0:0:ca60:ff:feea:9c03%2', u'10.8.25.114', u'0:0:0:0:0:0:0:1%1', u'127.0.0.1',
        u'127.0.1.1']

        maybe the node's identifiers changed with the ipv6 address which caused it to not be a member of the ring?

        Show
        Michael Kjellman added a comment - very possibly unrelated but opscenter seems to be unable to identify the nodes in the cluster either: 2013-01-02 18:39:24-0800 [] WARN: Unable to find a matching cluster for [u'fe80 :0:0:0:ca60:ff:feea:9c03%2', u'10.8.25.114', u'0:0:0:0:0:0:0:1%1', u'127.0.0.1', u'127.0.1.1'] maybe the node's identifiers changed with the ipv6 address which caused it to not be a member of the ring?
        Hide
        Michael Kjellman added a comment -

        from one of the last nodes to be upgraded...

        ERROR [GossipStage:906] 2013-01-02 13:51:44,982 AbstractCassandraDaemon.java (line 135) Exception in thread Thread[GossipStage:906,5,main]
        java.lang.RuntimeException: java.net.UnknownHostException: addr is of illegal length
                at org.apache.cassandra.gms.GossipDigestAckVerbHandler.doVerb(GossipDigestAckVerbHandler.java:89)
                at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
                at java.lang.Thread.run(Thread.java:722)
        Caused by: java.net.UnknownHostException: addr is of illegal length
                at java.net.InetAddress.getByAddress(InetAddress.java:979)
                at java.net.InetAddress.getByAddress(InetAddress.java:1374)
                at org.apache.cassandra.net.CompactEndpointSerializationHelper.deserialize(CompactEndpointSerializationHelper.java:39)
                at org.apache.cassandra.gms.EndpointStatesSerializationHelper.deserialize(GossipDigestSynMessage.java:117)
                at org.apache.cassandra.gms.GossipDigestAckMessageSerializer.deserialize(GossipDigestAckMessage.java:83)
                at org.apache.cassandra.gms.GossipDigestAckMessageSerializer.deserialize(GossipDigestAckMessage.java:70)
                at org.apache.cassandra.gms.GossipDigestAckVerbHandler.doVerb(GossipDigestAckVerbHandler.java:60)
                ... 4 more
        ERROR [GossipStage:907] 2013-01-02 13:51:45,984 AbstractCassandraDaemon.java (line 135) Exception in thread Thread[GossipStage:907,5,main]
        java.lang.RuntimeException: java.net.UnknownHostException: addr is of illegal length
                at org.apache.cassandra.gms.GossipDigestAckVerbHandler.doVerb(GossipDigestAckVerbHandler.java:89)
                at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
                at java.lang.Thread.run(Thread.java:722)
        Caused by: java.net.UnknownHostException: addr is of illegal length
                at java.net.InetAddress.getByAddress(InetAddress.java:979)
                at java.net.InetAddress.getByAddress(InetAddress.java:1374)
                at org.apache.cassandra.net.CompactEndpointSerializationHelper.deserialize(CompactEndpointSerializationHelper.java:39)
                at org.apache.cassandra.gms.EndpointStatesSerializationHelper.deserialize(GossipDigestSynMessage.java:117)
                at org.apache.cassandra.gms.GossipDigestAckMessageSerializer.deserialize(GossipDigestAckMessage.java:83)
                at org.apache.cassandra.gms.GossipDigestAckMessageSerializer.deserialize(GossipDigestAckMessage.java:70)
                at org.apache.cassandra.gms.GossipDigestAckVerbHandler.doVerb(GossipDigestAckVerbHandler.java:60)
                ... 4 more
        ERROR [GossipStage:908] 2013-01-02 13:51:46,988 AbstractCassandraDaemon.java (line 135) Exception in thread Thread[GossipStage:908,5,main]
        java.lang.RuntimeException: java.net.UnknownHostException: addr is of illegal length
                at org.apache.cassandra.gms.GossipDigestAckVerbHandler.doVerb(GossipDigestAckVerbHandler.java:89)
                at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
                at java.lang.Thread.run(Thread.java:722)
        Caused by: java.net.UnknownHostException: addr is of illegal length
                at java.net.InetAddress.getByAddress(InetAddress.java:979)
                at java.net.InetAddress.getByAddress(InetAddress.java:1374)
                at org.apache.cassandra.net.CompactEndpointSerializationHelper.deserialize(CompactEndpointSerializationHelper.java:39)
                at org.apache.cassandra.gms.EndpointStatesSerializationHelper.deserialize(GossipDigestSynMessage.java:117)
                at org.apache.cassandra.gms.GossipDigestAckMessageSerializer.deserialize(GossipDigestAckMessage.java:83)
                at org.apache.cassandra.gms.GossipDigestAckMessageSerializer.deserialize(GossipDigestAckMessage.java:70)
                at org.apache.cassandra.gms.GossipDigestAckVerbHandler.doVerb(GossipDigestAckVerbHandler.java:60)
                ... 4 more
        
        Show
        Michael Kjellman added a comment - from one of the last nodes to be upgraded... ERROR [GossipStage:906] 2013-01-02 13:51:44,982 AbstractCassandraDaemon.java (line 135) Exception in thread Thread [GossipStage:906,5,main] java.lang.RuntimeException: java.net.UnknownHostException: addr is of illegal length at org.apache.cassandra.gms.GossipDigestAckVerbHandler.doVerb(GossipDigestAckVerbHandler.java:89) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang. Thread .run( Thread .java:722) Caused by: java.net.UnknownHostException: addr is of illegal length at java.net.InetAddress.getByAddress(InetAddress.java:979) at java.net.InetAddress.getByAddress(InetAddress.java:1374) at org.apache.cassandra.net.CompactEndpointSerializationHelper.deserialize(CompactEndpointSerializationHelper.java:39) at org.apache.cassandra.gms.EndpointStatesSerializationHelper.deserialize(GossipDigestSynMessage.java:117) at org.apache.cassandra.gms.GossipDigestAckMessageSerializer.deserialize(GossipDigestAckMessage.java:83) at org.apache.cassandra.gms.GossipDigestAckMessageSerializer.deserialize(GossipDigestAckMessage.java:70) at org.apache.cassandra.gms.GossipDigestAckVerbHandler.doVerb(GossipDigestAckVerbHandler.java:60) ... 4 more ERROR [GossipStage:907] 2013-01-02 13:51:45,984 AbstractCassandraDaemon.java (line 135) Exception in thread Thread [GossipStage:907,5,main] java.lang.RuntimeException: java.net.UnknownHostException: addr is of illegal length at org.apache.cassandra.gms.GossipDigestAckVerbHandler.doVerb(GossipDigestAckVerbHandler.java:89) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang. Thread .run( Thread .java:722) Caused by: java.net.UnknownHostException: addr is of illegal length at java.net.InetAddress.getByAddress(InetAddress.java:979) at java.net.InetAddress.getByAddress(InetAddress.java:1374) at org.apache.cassandra.net.CompactEndpointSerializationHelper.deserialize(CompactEndpointSerializationHelper.java:39) at org.apache.cassandra.gms.EndpointStatesSerializationHelper.deserialize(GossipDigestSynMessage.java:117) at org.apache.cassandra.gms.GossipDigestAckMessageSerializer.deserialize(GossipDigestAckMessage.java:83) at org.apache.cassandra.gms.GossipDigestAckMessageSerializer.deserialize(GossipDigestAckMessage.java:70) at org.apache.cassandra.gms.GossipDigestAckVerbHandler.doVerb(GossipDigestAckVerbHandler.java:60) ... 4 more ERROR [GossipStage:908] 2013-01-02 13:51:46,988 AbstractCassandraDaemon.java (line 135) Exception in thread Thread [GossipStage:908,5,main] java.lang.RuntimeException: java.net.UnknownHostException: addr is of illegal length at org.apache.cassandra.gms.GossipDigestAckVerbHandler.doVerb(GossipDigestAckVerbHandler.java:89) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang. Thread .run( Thread .java:722) Caused by: java.net.UnknownHostException: addr is of illegal length at java.net.InetAddress.getByAddress(InetAddress.java:979) at java.net.InetAddress.getByAddress(InetAddress.java:1374) at org.apache.cassandra.net.CompactEndpointSerializationHelper.deserialize(CompactEndpointSerializationHelper.java:39) at org.apache.cassandra.gms.EndpointStatesSerializationHelper.deserialize(GossipDigestSynMessage.java:117) at org.apache.cassandra.gms.GossipDigestAckMessageSerializer.deserialize(GossipDigestAckMessage.java:83) at org.apache.cassandra.gms.GossipDigestAckMessageSerializer.deserialize(GossipDigestAckMessage.java:70) at org.apache.cassandra.gms.GossipDigestAckVerbHandler.doVerb(GossipDigestAckVerbHandler.java:60) ... 4 more
        Hide
        Brandon Williams added a comment -

        Here is a sad story of how multiple release cycles ended up causing a regression.

        The cause of these exceptions is CASSANDRA-4576. There, we added checks against VERSION_11 to prevent using the compatible mode with newer node that didn't need it. VERSION_11 has an actual value of 4. We closed the ticket on Sept 18, and that was that.

        Fast forward to November, where we closed CASSANDRA-4880. To do this, we needed a protocol version bump, and created VERSION_117, which has an actual value of 5. Unfortunately we used <= comparisons in CASSANDRA-4576, but now had created a version higher than VERSION_11 that still needed the compatibility, and we got our original bug back.

        The effect of this is if you upgrade from nodes on 1.1.7 or later to 1.2.0, the 1.2.0 nodes won't be able to gossip with the 1.1.7 nodes and they won't be visible in ring output on the 1.2.0 node until they too are on 1.2.0. The 1.1.7 nodes will still know about the 1.2.0 node, but they won't be able to successfully gossip with it, and keep it marked down.

        Patch attached to go ahead and compare more explicitly against VERSION_12 to fix this, but I think it highlights a deeper problem, which is that if we ever do need to do another protocol bump in a minor, stable branch, we're out of luck because there's no space between VERSION_117 and VERSION_12.

        Show
        Brandon Williams added a comment - Here is a sad story of how multiple release cycles ended up causing a regression. The cause of these exceptions is CASSANDRA-4576 . There, we added checks against VERSION_11 to prevent using the compatible mode with newer node that didn't need it. VERSION_11 has an actual value of 4. We closed the ticket on Sept 18, and that was that. Fast forward to November, where we closed CASSANDRA-4880 . To do this, we needed a protocol version bump, and created VERSION_117, which has an actual value of 5. Unfortunately we used <= comparisons in CASSANDRA-4576 , but now had created a version higher than VERSION_11 that still needed the compatibility, and we got our original bug back. The effect of this is if you upgrade from nodes on 1.1.7 or later to 1.2.0, the 1.2.0 nodes won't be able to gossip with the 1.1.7 nodes and they won't be visible in ring output on the 1.2.0 node until they too are on 1.2.0. The 1.1.7 nodes will still know about the 1.2.0 node, but they won't be able to successfully gossip with it, and keep it marked down. Patch attached to go ahead and compare more explicitly against VERSION_12 to fix this, but I think it highlights a deeper problem, which is that if we ever do need to do another protocol bump in a minor, stable branch, we're out of luck because there's no space between VERSION_117 and VERSION_12.
        Hide
        Sylvain Lebresne added a comment -

        if we ever do need to do another protocol bump in a minor, stable branch, we're out of luck

        I'm sorry I did not follow CASSANDRA_4880 more closely but hadn't we decided that we should not change the protocol version in a minor version because it breaks streaming and we were only fine doing that for major upgrades?

        Now I suppose what's done is done (though I wish some warning in the NEWS file for 1.1.7 had been added with CASSANDRA-4880 to explain that streaming would be broken during pre-1.1.7 to post-1.1.7 upgrades, and since it hadn't we should probably document it now), but as long as protocol bump means breaking streaming then I think we should maintain the rule "no bump in minor version" (not that I wouldn't be against lifting the "protocol bump == break streaming" limitation if possible but that's a different discussion).

        Show
        Sylvain Lebresne added a comment - if we ever do need to do another protocol bump in a minor, stable branch, we're out of luck I'm sorry I did not follow CASSANDRA_4880 more closely but hadn't we decided that we should not change the protocol version in a minor version because it breaks streaming and we were only fine doing that for major upgrades? Now I suppose what's done is done (though I wish some warning in the NEWS file for 1.1.7 had been added with CASSANDRA-4880 to explain that streaming would be broken during pre-1.1.7 to post-1.1.7 upgrades, and since it hadn't we should probably document it now), but as long as protocol bump means breaking streaming then I think we should maintain the rule "no bump in minor version" (not that I wouldn't be against lifting the "protocol bump == break streaming" limitation if possible but that's a different discussion).
        Hide
        Brandon Williams added a comment -

        hadn't we decided that we should not change the protocol version in a minor version because it breaks streaming and we were only fine doing that for major upgrades?

        We had, but unfortunately it was the only way to fix what is hopefully the last of our schema problems in 1.1. The impact from CASSANDRA-4880 is much worse than having to upgrade all nodes before streaming.

        I wish some warning in the NEWS file for 1.1.7 had been added with CASSANDRA-4880 to explain that streaming would be broken during pre-1.1.7 to post-1.1.7 upgrades

        Totally agree.

        Show
        Brandon Williams added a comment - hadn't we decided that we should not change the protocol version in a minor version because it breaks streaming and we were only fine doing that for major upgrades? We had, but unfortunately it was the only way to fix what is hopefully the last of our schema problems in 1.1. The impact from CASSANDRA-4880 is much worse than having to upgrade all nodes before streaming. I wish some warning in the NEWS file for 1.1.7 had been added with CASSANDRA-4880 to explain that streaming would be broken during pre-1.1.7 to post-1.1.7 upgrades Totally agree.
        Hide
        Jonathan Ellis added a comment -

        The whole thing is a gotcha for assuming that we don't change messaging versions in minor releases. The futureproof version would have been if (!(version >= MessagingService.VERSION_12)) which looks kind of bizarre and might well have been "fixed" later on anyway.

        So I'd file this under "if we have to take a mulligan and add another version mid-release-cycle, make sure we validate usages of the initial major version."

        And if we want to be extra careful we should probably include an unused version before new major releases (i.e. make VERSION_20=8 instead of 7).

        Show
        Jonathan Ellis added a comment - The whole thing is a gotcha for assuming that we don't change messaging versions in minor releases. The futureproof version would have been if (!(version >= MessagingService.VERSION_12)) which looks kind of bizarre and might well have been "fixed" later on anyway. So I'd file this under "if we have to take a mulligan and add another version mid-release-cycle, make sure we validate usages of the initial major version." And if we want to be extra careful we should probably include an unused version before new major releases (i.e. make VERSION_20=8 instead of 7).
        Hide
        Jonathan Ellis added a comment -

        if (!(version >= MessagingService.VERSION_12))

        Don't mind me, of course that simplifies to version < VERSION_12.

        +1 on the patch, and please update NEWS retroactively.

        Show
        Jonathan Ellis added a comment - if (!(version >= MessagingService.VERSION_12)) Don't mind me, of course that simplifies to version < VERSION_12 . +1 on the patch, and please update NEWS retroactively.
        Hide
        Brandon Williams added a comment -

        Committed; also updated 1.1 news to mention 4880

        Show
        Brandon Williams added a comment - Committed; also updated 1.1 news to mention 4880
        Hide
        Jeremiah Jordan added a comment -

        Should that also get added to the 1.2.X NEWS.TXT in an UPGRADING section?

        Show
        Jeremiah Jordan added a comment - Should that also get added to the 1.2.X NEWS.TXT in an UPGRADING section?

          People

          • Assignee:
            Brandon Williams
            Reporter:
            Michael Kjellman
            Reviewer:
            Jonathan Ellis
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development