Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Duplicate
-
None
-
None
-
None
Description
when there are many executors in a application(example:1000),Connection timeout often occure.Exception is:
WARN nio.SendingConnection: Error finishing connection
java.net.ConnectException: Connection timed out
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.spark.network.nio.SendingConnection.finishConnect(Connection.scala:342)
at org.apache.spark.network.nio.ConnectionManager$$anon$11.run(ConnectionManager.scala:273)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
that will make driver as these executors are lost,but in fact these executors are alive.so add retry mechanism to reduce the probability of the occurrence of this problem.
Attachments
Issue Links
- duplicates
-
SPARK-4188 Shuffle fetches should be retried at a lower level
- Resolved
- links to