Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-12008

Make decommission operations resumable

    XMLWordPrintableJSON

    Details

    • Impacts:
      Docs

      Description

      We're dealing with large data sets (multiple terabytes per node) and sometimes we need to add or remove nodes. These operations are very dependent on the entire cluster being up, so while we're joining a new node (which sometimes takes 6 hours or longer) a lot can go wrong and in a lot of cases something does.

      It would be great if the ability to retry streams was implemented.

      Example to illustrate the problem :

      03:18 PM   ~ $ nodetool decommission
      error: Stream failed
      -- StackTrace --
      org.apache.cassandra.streaming.StreamException: Stream failed
              at org.apache.cassandra.streaming.management.StreamEventJMXNotifier.onFailure(StreamEventJMXNotifier.java:85)
              at com.google.common.util.concurrent.Futures$6.run(Futures.java:1310)
              at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:457)
              at com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)
              at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145)
              at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:202)
              at org.apache.cassandra.streaming.StreamResultFuture.maybeComplete(StreamResultFuture.java:210)
              at org.apache.cassandra.streaming.StreamResultFuture.handleSessionComplete(StreamResultFuture.java:186)
              at org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:430)
              at org.apache.cassandra.streaming.StreamSession.complete(StreamSession.java:622)
              at org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:486)
              at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:274)
              at java.lang.Thread.run(Thread.java:745)
      
      
      08:04 PM   ~ $ nodetool decommission
      nodetool: Unsupported operation: Node in LEAVING state; wait for status to become normal or restart
      See 'nodetool help' or 'nodetool help <command>'.
      

      Streaming failed, probably due to load :

      ERROR [STREAM-IN-/<ipaddr>] 2016-06-14 18:05:47,275 StreamSession.java:520 - [Stream #<streamid>] Streaming error occurred
      java.net.SocketTimeoutException: null
              at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:211) ~[na:1.8.0_77]
              at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) ~[na:1.8.0_77]
              at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385) ~[na:1.8.0_77]
              at org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:54) ~[apache-cassandra-3.0.6.jar:3.0.6]
              at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:268) ~[apache-cassandra-3.0.6.jar:3.0.6]
              at java.lang.Thread.run(Thread.java:745) [na:1.8.0_77]
      

      If implementing retries is not possible, can we have a 'nodetool decommission resume'?

        Attachments

          Activity

            People

            • Assignee:
              kdmu Kaide Mu
              Reporter:
              tvdw Tom van der Woerdt
              Authors:
              Kaide Mu
              Reviewers:
              Paulo Motta
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: