Uploaded image for project: 'Oozie'
  1. Oozie
  2. OOZIE-3286

[spark-action] Launch Spark application from Oozie server

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 5.0.0
    • Fix Version/s: None
    • Component/s: action, core
    • Labels:
      None

      Description

      This is a major refactor / rework of the Spark action.

      Today Oozie Spark actions run as follows:

      1. SparkActionExecutor extends JavaActionExecutor (running as part of Oozie server code) launches a SparkMain (running as YARN application) the usual way on YARN using LauncherAM
      2. SparkMain runs on a YARN NodeManager container and calls from the same JVM SparkSubmit
      3. SparkSubmit fires up a Spark driver:
        • in local or yarn-client mode in the same YARN NodeManager JVM where SparkMain runs. In yarn-client mode Spark's ApplicationMaster runs inside the same JVM where the driver runs
        • in yarn-cluster mode Spark driver runs in a different JVM than Spark's ApplicationMaster, maybe on a different host. It can go away after the YARN application has been submitted

      Problems with this approach are:

      • too many levels of indirection cause lots of latency, and a whole new world of communication errors can happen
      • since SparkSubmit is launched from the same JVM where SparkMain runs, the Spark application will share all the environment variables, classpath, sharelib dependencies etc. with Oozie's Spark launcher code, causing hard-to-nail-down environment and classpath issues

      The future of Oozie Spark action launching looks like this:

      1. SparkActionExecutor (running as part of Oozie server code) launches an InProcessLauncher (available since Spark 2.3) always in yarn-cluster mode
      2. InProcessLauncher calls either InProcessSparkSubmit or SparkSubmit. We translate any Spark application modes to yarn-cluster, so that Spark driver will run in a JVM different than the Spark driver
      3. this allows for much better resource usage, since we have one YARN ApplicationMaster (Oozie's LauncherAM) less
      4. since the Spark driver and executor are always launched in different JVMs, we don't have any interference of environment variables, driver, or executor classpath

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                andras.piros Andras Piros
                Reporter:
                andras.piros Andras Piros
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated: