[SPARK-23191] Workers registration failes in case of network drop - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.6.3, 2.2.1, 2.3.0
Fix Version/s: 3.0.0
Component/s: Spark Core
Labels:
None
Environment:

OS:- Centos 6.9(64 bit)

Description

We have a 3 node cluster. We were facing issues of multiple driver running in some scenario in production.

On further investigation we were able to reproduce iin both 1.6.3 and 2.2.1 versions the scenario with following steps:-

Setup a 3 node cluster. Start master and slaves.
On any node where the worker process is running block the connections on port 7077 using iptables.
```
iptables -A OUTPUT -p tcp --dport 7077 -j DROP
```

After about 10-15 secs we get the error on node that it is unable to connect to master.

2018-01-23 12:08:51,639 [rpc-client-1-1] WARN  org.apache.spark.network.server.TransportChannelHandler - Exception in connection from <servername>
java.io.IOException: Connection timed out
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:192)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
        at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
        at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
        at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
        at java.lang.Thread.run(Thread.java:745)
2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting for master to reconnect...
2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting for master to reconnect...

Once we get this exception we renable the connections to port 7077 using
```
iptables -D OUTPUT -p tcp --dport 7077 -j DROP
```

Worker tries to register again with master but is unable to do so. It gives following error

2018-01-23 12:08:58,657 [worker-register-master-threadpool-2] WARN  org.apache.spark.deploy.worker.Worker - Failed to connect to master <servername>:7077
org.apache.spark.SparkException: Exception thrown in awaitResult:
        at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
        at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
        at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
        at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108)
        at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:241)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to <servername>:7077
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
        at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
        at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
        at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
        ... 4 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: <servername>:7077
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
        ... 1 more
2018-01-23 12:09:03,705 [dispatcher-event-loop-5] ERROR org.apache.spark.deploy.worker.Worker - Worker registration failed: Duplicate worker ID
2018-01-23 12:09:03,705 [dispatcher-event-loop-5] ERROR org.apache.spark.deploy.worker.Worker - Worker registration failed: Duplicate worker ID

The worker state is changed to DEAD in spark UI. As a result of which duplicate driver is launched.

Attachments

Issue Links

fixes

SPARK-16190 Worker registration failed: Duplicate worker ID

Resolved

links to

GitHub Pull Request #24569

Workers registration failes in case of network drop

Details

Description

Attachments

Issue Links

Activity

People

Dates