Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
1.6.3, 2.2.1, 2.3.0
-
None
-
OS:- Centos 6.9(64 bit)
Description
We have a 3 node cluster. We were facing issues of multiple driver running in some scenario in production.
On further investigation we were able to reproduce iin both 1.6.3 and 2.2.1 versions the scenario with following steps:-
- Setup a 3 node cluster. Start master and slaves.
- On any node where the worker process is running block the connections on port 7077 using iptables.
iptables -A OUTPUT -p tcp --dport 7077 -j DROP
- After about 10-15 secs we get the error on node that it is unable to connect to master.
2018-01-23 12:08:51,639 [rpc-client-1-1] WARN org.apache.spark.network.server.TransportChannelHandler - Exception in connection from <servername> java.io.IOException: Connection timed out at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) at java.lang.Thread.run(Thread.java:745) 2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting for master to reconnect... 2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting for master to reconnect...
- Once we get this exception we renable the connections to port 7077 using
iptables -D OUTPUT -p tcp --dport 7077 -j DROP
- Worker tries to register again with master but is unable to do so. It gives following error
2018-01-23 12:08:58,657 [worker-register-master-threadpool-2] WARN org.apache.spark.deploy.worker.Worker - Failed to connect to master <servername>:7077 org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100) at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108) at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:241) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Failed to connect to <servername>:7077 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182) at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190) ... 4 more Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: <servername>:7077 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) ... 1 more 2018-01-23 12:09:03,705 [dispatcher-event-loop-5] ERROR org.apache.spark.deploy.worker.Worker - Worker registration failed: Duplicate worker ID 2018-01-23 12:09:03,705 [dispatcher-event-loop-5] ERROR org.apache.spark.deploy.worker.Worker - Worker registration failed: Duplicate worker ID
- The worker state is changed to DEAD in spark UI. As a result of which duplicate driver is launched.
Attachments
Issue Links
- fixes
-
SPARK-16190 Worker registration failed: Duplicate worker ID
- Resolved
- links to