Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-6082

1.1.12 --> 1.2.x upgrade may result inconsistent ring

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Low
    • Resolution: Duplicate
    • None
    • None
    • None
    • 1.1.12 --> 1.2.9

    • Low

    Description

      This happened to me once, and since I don't have any more 1.1.x clusters I won't be testing again. I hope the attached files are enough for someone to connect the dots.

      I did a rolling restart to upgrade from 1.1.12 --> 1.2.9. About a week later I discovered that one node was in an inconsistent state in the ring. It was either:

      • up
      • host-id=null
      • missing

      Depending on which node I ran nodetool status from. I think I just missed this during the upgrade but can not rule out the possibility that it "just happened for no reason" some time after the upgrade. It was detected when running repair in such a ring caused all sorts of terrible data "duplication" and performance tanked. Restarting the seeds + "bad" node caused the ring to be consistent again.

      Two possibly suspicious things are a ArrayIndexOutOfBoundsException on startup:

      ERROR [GossipStage:1] 2013-09-06 10:45:35,213 CassandraDaemon.java (line 194) Exception in thread Thread[GossipStage:1,5,main]
      java.lang.ArrayIndexOutOfBoundsException: 2
              at org.apache.cassandra.service.StorageService.extractExpireTime(StorageService.java:1660)
              at org.apache.cassandra.service.StorageService.handleStateRemoving(StorageService.java:1607)
              at org.apache.cassandra.service.StorageService.onChange(StorageService.java:1230)
              at org.apache.cassandra.service.StorageService.onJoin(StorageService.java:1958)
              at org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:841)
              at org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:919)
              at org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:50)
              at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
              at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
              at java.lang.Thread.run(Thread.java:662)
      

      and problems to hint delivery to multiple node.

      ERROR [MutationStage:11] 2013-09-06 13:59:19,604 CassandraDaemon.java (line 194) Exception in thread Thread[MutationStage:11,5,main]
      java.lang.AssertionError: Missing host ID for 10.20.2.45
              at org.apache.cassandra.service.StorageProxy.writeHintForMutation(StorageProxy.java:583)
              at org.apache.cassandra.service.StorageProxy$5.runMayThrow(StorageProxy.java:552)
              at org.apache.cassandra.service.StorageProxy$HintRunnable.run(StorageProxy.java:1658)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
              at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
              at java.util.concurrent.FutureTask.run(FutureTask.java:138)
              at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
              at java.lang.Thread.run(Thread.java:662)
      

      Not however that while there were delivery problems to multiple nodes during the rolling upgrade, only one node was in a funky state a week later.

      Attached are the results of running gossipinfo and status on every node.

      Attachments

        1. c-gossipinfo
          118 kB
          Chris Burroughs
        2. c-status
          39 kB
          Chris Burroughs

        Issue Links

          Activity

            People

              Unassigned Unassigned
              cburroughs Chris Burroughs
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: