Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23191

Workers registration failes in case of network drop

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.6.3, 2.2.1, 2.3.0
    • 3.0.0
    • Spark Core
    • None
    • OS:- Centos 6.9(64 bit)

       

    Description

      We have a 3 node cluster. We were facing issues of multiple driver running in some scenario in production.

      On further investigation we were able to reproduce iin both 1.6.3 and 2.2.1 versions the scenario with following steps:-

      1. Setup a 3 node cluster. Start master and slaves.
      2. On any node where the worker process is running block the connections on port 7077 using iptables.
        iptables -A OUTPUT -p tcp --dport 7077 -j DROP
        
      1. After about 10-15 secs we get the error on node that it is unable to connect to master.
        2018-01-23 12:08:51,639 [rpc-client-1-1] WARN  org.apache.spark.network.server.TransportChannelHandler - Exception in connection from <servername>
        java.io.IOException: Connection timed out
                at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
                at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
                at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
                at sun.nio.ch.IOUtil.read(IOUtil.java:192)
                at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
                at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
                at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
                at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275)
                at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
                at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
                at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
                at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
                at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
                at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
                at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
                at java.lang.Thread.run(Thread.java:745)
        2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting for master to reconnect...
        2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting for master to reconnect...
        
        
      1. Once we get this exception we renable the connections to port 7077 using
        iptables -D OUTPUT -p tcp --dport 7077 -j DROP
        
      1. Worker tries to register again with master but is unable to do so. It gives following error
      2018-01-23 12:08:58,657 [worker-register-master-threadpool-2] WARN  org.apache.spark.deploy.worker.Worker - Failed to connect to master <servername>:7077
      org.apache.spark.SparkException: Exception thrown in awaitResult:
              at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
              at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
              at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
              at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108)
              at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:241)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
              at java.lang.Thread.run(Thread.java:745)
      Caused by: java.io.IOException: Failed to connect to <servername>:7077
              at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
              at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
              at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
              at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
              at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
              ... 4 more
      Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: <servername>:7077
              at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
              at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
              at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)
              at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
              at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631)
              at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
              at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
              at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
              at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
              at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
              ... 1 more
      2018-01-23 12:09:03,705 [dispatcher-event-loop-5] ERROR org.apache.spark.deploy.worker.Worker - Worker registration failed: Duplicate worker ID
      2018-01-23 12:09:03,705 [dispatcher-event-loop-5] ERROR org.apache.spark.deploy.worker.Worker - Worker registration failed: Duplicate worker ID
      1. The worker state is changed to DEAD in spark UI. As a result of which duplicate driver is launched.

      Attachments

        Issue Links

          Activity

            People

              Ngone51 wuyi
              neeraj20gupta Neeraj Gupta
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: