Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
4.0.0
Description
Under K8s cluster deployment mode, all the jars, including primary resource jar, jars from --jars or spark.jars, will be downloaded to driver local and then served to executors through file server running on driver.
When jars are big and the application requests a lot of executors, the massive concurrent jars download from the driver will cause network saturation. In this case, the executors jar download will timeout, causing executors to be terminated. From user point of view, the application is trapped in the loop of massive executor loss and re-provision, but never gets enough live executors as requested, which leads to job SLA breach or sometimes job failure.
Attachments
Issue Links
- links to