Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.2.0
-
None
Description
Hi,
lately we came across following corner scenario:
We are using dynamic allocation with external shuffle service that is managed by marathon.
Due to some network/operation issue, the external shuffle service on one of the machines(mesos-slaves) is not available for few seconds(e.g. marathon haven't provisioned yet the external shuffle service on particular node, but framework itself already accepted offer on this node and tries to startup executor)
This makes framework(spark driver) to fail and I see error from stderr of driver(seems like mesos-agent asks driver to abort itself), however spark context continues to run(seems like in kind of zombi mode, since it can't release resources to cluster and can't get additional offers since the framework is aborted from mesos perspective)
The framework in mesos UI move to "inactive" state.
skonto susanxhuynh any input on this problem? Have you came across such behavior?
I'm ready to work on some patch, but currently I don't understand where to start, seems like driver is too fragile in this sense and something in mesos-spark integration is missing
I0412 07:31:25.827283 274 sched.cpp:759] Framework registered with 15d9838f-b266-413b-842d-f7c3567bd04a-0051 Exception in thread "Thread-295" java.io.IOException: Failed to connect tomy-company.com/10.106.14.61:7337 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182) at org.apache.spark.network.shuffle.mesos.MesosExternalShuffleClient.registerDriverWithShuffleService(MesosExternalShuffleClient.java:75) at org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend.statusUpdate(MesosCoarseGrainedSchedulerBackend.scala:537) Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: my-company.com/10.106.14.61:7337 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) at java.lang.Thread.run(Thread.java:748) I0412 07:35:12.032925 277 sched.cpp:2055] Asked to abort the driver I0412 07:35:12.033035 277 sched.cpp:1233] Aborting framework 15d9838f-b266-413b-842d-f7c3567bd04a-0051
Attachments
Issue Links
- relates to
-
SPARK-15359 Mesos dispatcher should handle DRIVER_ABORTED status from mesosDriver.run()
- Resolved