Details
-
Bug
-
Status: Resolved
-
Normal
-
Resolution: Fixed
-
None
-
None
-
Normal
Description
I'm not sure if anybody has tried using trunk and starting an actual EC2 cluster, but with a 3 node setup that works on 3.11, it fails to work on trunk. I mistakenly stumbled into it while testing my own code changes.
My setup is as follows:
broadcast_rpc_address: publicIP
broadcast_address: publicIP
listen_address: omitted. Ends up as privateIP.
Works on 3.11 just fine.
On trunk though, it never works. My node is never able to join the cluster:
Apr 03 05:57:06 ip-10-0-47-122 cassandra[13914]: INFO [main] 2018-04-03 05:57:05,895 RangeStreamer.java:195 - Bootstrap: range (-128373781239966537,-122439194129870521] exists on 52.88.241.181:7000 for keyspace system_traces Apr 03 05:57:06 ip-10-0-47-122 cassandra[13914]: INFO [main] 2018-04-03 05:57:05,895 RangeStreamer.java:195 - Bootstrap: range (6968670424536541270,6973888347502882935] exists on 52.88.241.181:7000 for keyspace system_traces Apr 03 05:57:42 ip-10-0-47-122 cassandra[13914]: WARN [GossipTasks:1] 2018-04-03 05:57:42,298 FailureDetector.java:324 - Not marking nodes down due to local pause of 26215173446ns > 5000000000ns Apr 03 05:57:53 ip-10-0-47-122 cassandra[13914]: WARN [GossipTasks:1] 2018-04-03 05:57:53,035 FailureDetector.java:324 - Not marking nodes down due to local pause of 10736485907ns > 5000000000ns Apr 03 05:58:30 ip-10-0-47-122 cassandra[13914]: WARN [GossipTasks:1] 2018-04-03 05:58:30,790 Gossiper.java:814 - Gossip stage has 28 pending tasks; skipping status check (no nodes will be marked down) Apr 03 05:58:33 ip-10-0-47-122 cassandra[13914]: WARN [GossipTasks:1] 2018-04-03 05:58:33,060 Gossiper.java:814 - Gossip stage has 20 pending tasks; skipping status check (no nodes will be marked down) Apr 03 06:04:33 ip-10-0-47-122 cassandra[13914]: WARN [GossipTasks:1] 2018-04-03 06:04:33,826 FailureDetector.java:324 - Not marking nodes down due to local pause of 400790432954ns > 5000000000ns Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: WARN [GossipTasks:1] 2018-04-03 06:04:49,133 Gossiper.java:814 - Gossip stage has 2 pending tasks; skipping status check (no nodes will be marked down) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: ERROR [StreamConnectionEstablisher:1] 2018-04-03 06:04:49,138 StreamSession.java:524 - [Stream #d4cd6420-3703-11e8-a6a5-e51ddc10cfe6] Streaming error occurred on session with peer 52.88.241.181:7000 Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: java.io.IOException: failed to connect to 52.88.241.181:7000 (STREAM) for streaming data Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at org.apache.cassandra.streaming.DefaultConnectionFactory.createConnection(DefaultConnectionFactory.java:98) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at org.apache.cassandra.streaming.DefaultConnectionFactory.createConnection(DefaultConnectionFactory.java:57) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at org.apache.cassandra.streaming.async.NettyStreamingMessageSender.createChannel(NettyStreamingMessageSender.java:183) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at org.apache.cassandra.streaming.async.NettyStreamingMessageSender.setupControlMessageChannel(NettyStreamingMessageSender.java:165) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at org.apache.cassandra.streaming.async.NettyStreamingMessageSender.sendMessage(NettyStreamingMessageSender.java:222) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at org.apache.cassandra.streaming.async.NettyStreamingMessageSender.initialize(NettyStreamingMessageSender.java:146) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at org.apache.cassandra.streaming.StreamSession.start(StreamSession.java:271) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at org.apache.cassandra.streaming.StreamCoordinator$StreamSessionConnector.run(StreamCoordinator.java:273) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at java.lang.Thread.run(Thread.java:748) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: Caused by: io.netty.channel.unix.Errors$NativeIoException: bind(..) failed: Cannot assign requested address Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at io.netty.channel.unix.Errors.newIOException(Errors.java:117) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at io.netty.channel.unix.Socket.bind(Socket.java:266) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at io.netty.channel.epoll.AbstractEpollStreamChannel.doConnect(AbstractEpollStreamChannel.java:729) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at io.netty.channel.epoll.EpollSocketChannel.doConnect(EpollSocketChannel.java:184) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.connect(AbstractEpollStreamChannel.java:797) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1274) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:545) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:530) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:999) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:260) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at io.netty.bootstrap.Bootstrap$3.run(Bootstrap.java:254) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:312) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: ... 1 common frames omitted Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: INFO [StreamConnectionEstablisher:1] 2018-04-03 06:04:49,140 StreamResultFuture.java:197 - [Stream #d4cd6420-3703-11e8-a6a5-e51ddc10cfe6] Session with 52.88.241.181:7000 is complete Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: ERROR [StreamConnectionEstablisher:1] 2018-04-03 06:04:49,146 StreamSession.java:524 - [Stream #d4cd6420-3703-11e8-a6a5-e51ddc10cfe6] Streaming error occurred on session with peer 52.88.241.181:7000 Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: java.lang.RuntimeException: stream has been closed, cannot send Prepare SYN (3 requests, 0 files} Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at org.apache.cassandra.streaming.async.NettyStreamingMessageSender.sendMessage(NettyStreamingMessageSender.java:209) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at org.apache.cassandra.streaming.StreamSession.onInitializationComplete(StreamSession.java:495) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at org.apache.cassandra.streaming.StreamSession.start(StreamSession.java:272) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at org.apache.cassandra.streaming.StreamCoordinator$StreamSessionConnector.run(StreamCoordinator.java:273) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at java.lang.Thread.run(Thread.java:748)
Remote debugging it, it eventually seems to come to be due to trying to call socket.bind in doConnect in AbstractEpollStreamChannel. Local address is <publicIP>:0. I think port 0 is fine because it just means pick the next ephemeral port available, but there's no way of binding to the public IP.
I made a commit https://github.com/apache/cassandra/commit/00222ddf0694f1c35c2bdd84fd7407174c3fc57a that seems to have fixed it so I can continue on with my purposes, and that is have StreamSession use listen address instead of broadcast address which may be public. In the event listen address is public I guess that means if it was able to bind to that interface to begin with then socket.bind should also work.
Not sure if this is the right way though, and it seems to have been introduced by the Streaming rework to netty. https://issues.apache.org/jira/browse/CASSANDRA-12229.