Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24794

DriverWrapper should have both master addresses in -Dspark.master

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.2.1
    • Fix Version/s: 3.0.0
    • Component/s: Deploy
    • Labels:
      None

      Description

      In standalone cluster mode, one could launch a Driver with supervise mode enabled. Spark launches the driver with a JVM argument -Dspark.master which is set to host and port of current master.

       

      During the life of context, the spark masters can switch due to any reason. After that if the driver dies unexpectedly and comes up it tries to connect with the master which was set initially with -Dspark.master but that master is in STANDBY mode. The context tries multiple times to connect to standby and then just kills itself.

       

      Suggestion:

      While launching the driver process, Spark master should use the spark.master passed as input instead of master and port of the current master.

      Log messages that we observe:

       

      2018-07-11 13:03:21,801 INFO appclient-register-master-threadpool-0 org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint []: Connecting to master spark://10.100.100.22:7077..
      .....
      2018-07-11 13:03:21,806 INFO netty-rpc-connection-0 org.apache.spark.network.client.TransportClientFactory []: Successfully created connection to /10.100.100.22:7077 after 1 ms (0 ms spent in bootstraps)
      .....
      2018-07-11 13:03:41,802 INFO appclient-register-master-threadpool-0 org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint []: Connecting to master spark://10.100.100.22:7077...
      .....
      2018-07-11 13:04:01,802 INFO appclient-register-master-threadpool-0 org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint []: Connecting to master spark://10.100.100.22:7077...
      .....
      2018-07-11 13:04:21,806 ERROR appclient-registration-retry-thread org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend []: Application has been killed. Reason: All masters are unresponsive! Giving up.

        Attachments

          Activity

            People

            • Assignee:
              bsikander Behroz Sikander
              Reporter:
              bsikander Behroz Sikander
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: