Details
-
Improvement
-
Status: Resolved
-
Low
-
Resolution: Fixed
-
None
-
Docs
Description
We're dealing with large data sets (multiple terabytes per node) and sometimes we need to add or remove nodes. These operations are very dependent on the entire cluster being up, so while we're joining a new node (which sometimes takes 6 hours or longer) a lot can go wrong and in a lot of cases something does.
It would be great if the ability to retry streams was implemented.
Example to illustrate the problem :
03:18 PM ~ $ nodetool decommission error: Stream failed -- StackTrace -- org.apache.cassandra.streaming.StreamException: Stream failed at org.apache.cassandra.streaming.management.StreamEventJMXNotifier.onFailure(StreamEventJMXNotifier.java:85) at com.google.common.util.concurrent.Futures$6.run(Futures.java:1310) at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:457) at com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156) at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145) at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:202) at org.apache.cassandra.streaming.StreamResultFuture.maybeComplete(StreamResultFuture.java:210) at org.apache.cassandra.streaming.StreamResultFuture.handleSessionComplete(StreamResultFuture.java:186) at org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:430) at org.apache.cassandra.streaming.StreamSession.complete(StreamSession.java:622) at org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:486) at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:274) at java.lang.Thread.run(Thread.java:745) 08:04 PM ~ $ nodetool decommission nodetool: Unsupported operation: Node in LEAVING state; wait for status to become normal or restart See 'nodetool help' or 'nodetool help <command>'.
Streaming failed, probably due to load :
ERROR [STREAM-IN-/<ipaddr>] 2016-06-14 18:05:47,275 StreamSession.java:520 - [Stream #<streamid>] Streaming error occurred java.net.SocketTimeoutException: null at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:211) ~[na:1.8.0_77] at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) ~[na:1.8.0_77] at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385) ~[na:1.8.0_77] at org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:54) ~[apache-cassandra-3.0.6.jar:3.0.6] at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:268) ~[apache-cassandra-3.0.6.jar:3.0.6] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_77]
If implementing retries is not possible, can we have a 'nodetool decommission resume'?
Attachments
Issue Links
- relates to
-
CASSANDRA-16290 Consistency can be violated when bootstrap or decommission is resumed after node restart
- Open