Uploaded image for project: 'Apache Cassandra'
  1. Apache Cassandra
  2. CASSANDRA-12008

Make decommission operations resumable

    XMLWordPrintableJSON

Details

    • Docs

    Description

      We're dealing with large data sets (multiple terabytes per node) and sometimes we need to add or remove nodes. These operations are very dependent on the entire cluster being up, so while we're joining a new node (which sometimes takes 6 hours or longer) a lot can go wrong and in a lot of cases something does.

      It would be great if the ability to retry streams was implemented.

      Example to illustrate the problem :

      03:18 PM   ~ $ nodetool decommission
      error: Stream failed
      -- StackTrace --
      org.apache.cassandra.streaming.StreamException: Stream failed
              at org.apache.cassandra.streaming.management.StreamEventJMXNotifier.onFailure(StreamEventJMXNotifier.java:85)
              at com.google.common.util.concurrent.Futures$6.run(Futures.java:1310)
              at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:457)
              at com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)
              at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145)
              at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:202)
              at org.apache.cassandra.streaming.StreamResultFuture.maybeComplete(StreamResultFuture.java:210)
              at org.apache.cassandra.streaming.StreamResultFuture.handleSessionComplete(StreamResultFuture.java:186)
              at org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:430)
              at org.apache.cassandra.streaming.StreamSession.complete(StreamSession.java:622)
              at org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:486)
              at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:274)
              at java.lang.Thread.run(Thread.java:745)
      
      
      08:04 PM   ~ $ nodetool decommission
      nodetool: Unsupported operation: Node in LEAVING state; wait for status to become normal or restart
      See 'nodetool help' or 'nodetool help <command>'.
      

      Streaming failed, probably due to load :

      ERROR [STREAM-IN-/<ipaddr>] 2016-06-14 18:05:47,275 StreamSession.java:520 - [Stream #<streamid>] Streaming error occurred
      java.net.SocketTimeoutException: null
              at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:211) ~[na:1.8.0_77]
              at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) ~[na:1.8.0_77]
              at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385) ~[na:1.8.0_77]
              at org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:54) ~[apache-cassandra-3.0.6.jar:3.0.6]
              at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:268) ~[apache-cassandra-3.0.6.jar:3.0.6]
              at java.lang.Thread.run(Thread.java:745) [na:1.8.0_77]
      

      If implementing retries is not possible, can we have a 'nodetool decommission resume'?

      Attachments

        Issue Links

          Activity

            People

              kdmu Kaide Mu
              tvdw Tom van der Woerdt
              Kaide Mu
              Paulo Motta
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: