Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10789

spark-submit in cluster mode can't use third-party libraries

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Won't Fix
    • Affects Version/s: 1.5.0, 1.6.0
    • Fix Version/s: None
    • Component/s: Spark Submit
    • Labels:
      None

      Description

      When using cluster deploy mode, the classpath of the SparkSubmit process that gets launched only includes the Spark assembly and not spark.driver.extraClassPath. This is of course by design, since the driver actually runs on the cluster and not inside the SparkSubmit process.

      However, if the SparkSubmit process, minimal as it may be, needs any extra libraries that are not part of the Spark assembly, there is no good way to include them. (I say "no good way" because including them in the SPARK_CLASSPATH environment variable does cause the SparkSubmit process to include them, but this is not acceptable because this environment variable has long been deprecated, and it prevents the use of spark.driver.extraClassPath.)

      An example of when this matters is on Amazon EMR when using an S3 path for the application JAR and running in yarn-cluster mode. The SparkSubmit process needs the EmrFileSystem implementation and its dependencies in the classpath in order to download the application JAR from S3, so it fails with a ClassNotFoundException. (EMR currently gets around this by setting SPARK_CLASSPATH, but as mentioned above this is less than ideal.)

      I have tried modifying SparkSubmitCommandBuilder to include the driver extra classpath whether it's client mode or cluster mode, and this seems to work, but I don't know if there is any downside to this.

      Example that fails on emr-4.0.0 (if you switch to setting spark.(driver,executor).extraClassPath instead of SPARK_CLASSPATH): spark-submit --deploy-mode cluster --class org.apache.spark.examples.JavaWordCount s3://my-bucket/spark-examples.jar s3://my-bucket/word-count-input.txt

      Resulting Exception:
      Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
      at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
      at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2626)
      at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2639)
      at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
      at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2678)
      at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2660)
      at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:374)
      at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
      at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:233)
      at org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:327)
      at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:366)
      at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:364)
      at scala.collection.immutable.List.foreach(List.scala:318)
      at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:364)
      at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:629)
      at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
      at org.apache.spark.deploy.yarn.Client.run(Client.scala:907)
      at org.apache.spark.deploy.yarn.Client$.main(Client.scala:966)
      at org.apache.spark.deploy.yarn.Client.main(Client.scala)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:606)
      at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
      at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
      at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
      at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
      at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
      Caused by: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
      at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1980)
      at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2072)
      ... 27 more

      1. SPARK-10789.v1.6.0.diff
        0.9 kB
        Jonathan Kelly
      2. SPARK-10789.diff
        0.9 kB
        Jonathan Kelly

        Activity

        Hide
        vanzin Marcelo Vanzin added a comment -

        I'm going to close this as won't fix for a couple of reasons:

        • with Spark 2 it should be much easier to add these things to the spark-submit classpath (just drop new files in the jars/ directory); Spark 2.3 also has a new profile in the build that helps with creating a distribution with all the extra hadoop fs implementations.
        • the fix for SPARK-20059 includes some changes that make jars added with --jars visible to the cluster launcher too; although I'm not sure whether that would work for FileSystem implementations.

        I don't think an extra config option here makes sense, given how easy it is to add this stuff now.

        Show
        vanzin Marcelo Vanzin added a comment - I'm going to close this as won't fix for a couple of reasons: with Spark 2 it should be much easier to add these things to the spark-submit classpath (just drop new files in the jars/ directory); Spark 2.3 also has a new profile in the build that helps with creating a distribution with all the extra hadoop fs implementations. the fix for SPARK-20059 includes some changes that make jars added with --jars visible to the cluster launcher too; although I'm not sure whether that would work for FileSystem implementations. I don't think an extra config option here makes sense, given how easy it is to add this stuff now.
        Hide
        jonathak Jonathan Kelly added a comment -

        Thanks, Sean Owen, that makes sense and was something else I was considering as well. I suppose it doesn't necessarily make sense to add a new Spark property when this is really more of a Hadoop-related issue than a Spark-related one (in that the extra library needed is for a custom Hadoop FileSystem).

        One downside though is that the EMRFS libraries would then be duplicated inside the Spark assembly, making it even more massive than it already is, but it already contains so much duplication (of other jars already present elsewhere on the cluster, that is) that it doesn't seem so bad. Also, I think this would just require some changes to the pom.xml rather than a code patch, so that's nice. Lastly, I saw that removing the necessity for a Spark assembly is under consideration for Spark 2.x, so hopefully any downside to adding more libraries to the assembly now will be mitigated once the Spark assembly is no longer necessary because then we could just make sure that the EMRFS libraries are in whatever list of jars are included in the Spark classpath.

        And yes, the title of this JIRA issue has been bugging me too. I'm not sure why I gave it such a specific title without referencing the actual problem. I'll fix it.

        Show
        jonathak Jonathan Kelly added a comment - Thanks, Sean Owen , that makes sense and was something else I was considering as well. I suppose it doesn't necessarily make sense to add a new Spark property when this is really more of a Hadoop-related issue than a Spark-related one (in that the extra library needed is for a custom Hadoop FileSystem). One downside though is that the EMRFS libraries would then be duplicated inside the Spark assembly, making it even more massive than it already is, but it already contains so much duplication (of other jars already present elsewhere on the cluster, that is) that it doesn't seem so bad. Also, I think this would just require some changes to the pom.xml rather than a code patch, so that's nice. Lastly, I saw that removing the necessity for a Spark assembly is under consideration for Spark 2.x, so hopefully any downside to adding more libraries to the assembly now will be mitigated once the Spark assembly is no longer necessary because then we could just make sure that the EMRFS libraries are in whatever list of jars are included in the Spark classpath. And yes, the title of this JIRA issue has been bugging me too. I'm not sure why I gave it such a specific title without referencing the actual problem. I'll fix it.
        Hide
        jonathak Jonathan Kelly added a comment - - edited

        Yes, using this patch requires a rebuild. If you are using Spark on YARN, the Spark assembly should only need to be on the master node, but yes, you'd need to distribute the new assembly across your cluster if you are using Spark Standalone.

        Also, yes, the spark.

        {driver,executor}

        .extra

        {ClassPath,Library}

        lets you distribute extra jar and so files without a rebuild, but the point of this JIRA issue is that spark.driver.extraClassPath takes effect with client deploy-mode but not cluster deploy-mode. This means that if you need any extra jars for accessing a custom Hadoop FileSystem to get the application jar (e.g., EMRFS), they'll either need to be included in the Spark assembly jar, or you'll need this patch.

        Show
        jonathak Jonathan Kelly added a comment - - edited Yes, using this patch requires a rebuild. If you are using Spark on YARN, the Spark assembly should only need to be on the master node, but yes, you'd need to distribute the new assembly across your cluster if you are using Spark Standalone. Also, yes, the spark. {driver,executor} .extra {ClassPath,Library} lets you distribute extra jar and so files without a rebuild, but the point of this JIRA issue is that spark.driver.extraClassPath takes effect with client deploy-mode but not cluster deploy-mode. This means that if you need any extra jars for accessing a custom Hadoop FileSystem to get the application jar (e.g., EMRFS), they'll either need to be included in the Spark assembly jar, or you'll need this patch.
        Hide
        srowen Sean Owen added a comment -

        I don't think we want to build another config flag in here. It sounds like you want to build an assembly that's appropriate for your version of "Hadoop", which includes access to custom file systems. The general practice here has been to do just that, so then you could include s3 FS libs as desired. It's probably good practice to make your own build anyway if it needs to harmonize with EMR's version of Hadoop.

        (BTW you could update the title to reference the problem more directly: spark-submit in cluster mode can't use third-party libraries or something. Including the assembly isn't a problem.)

        Show
        srowen Sean Owen added a comment - I don't think we want to build another config flag in here. It sounds like you want to build an assembly that's appropriate for your version of "Hadoop", which includes access to custom file systems. The general practice here has been to do just that, so then you could include s3 FS libs as desired. It's probably good practice to make your own build anyway if it needs to harmonize with EMR's version of Hadoop. (BTW you could update the title to reference the problem more directly: spark-submit in cluster mode can't use third-party libraries or something. Including the assembly isn't a problem.)
        Hide
        roireshef Roi Reshef added a comment - - edited

        Thanks Jonathan Kelly. That requires rebuilding spark and redistributing it across my cluster, right? I finally figured out a solution to import external jars without rebuilding spark. One can modify two configurations inside spark-env.sh (at least for Netlib package, which include *.jar and *.so):
        spark.

        {driver,executor}.extraClassPath - for *.jar
        spark.{driver,executor}

        .extraLibraryPath - for *.so

        And spark (I'm using v1.5.2) will pick them up automatically

        Show
        roireshef Roi Reshef added a comment - - edited Thanks Jonathan Kelly . That requires rebuilding spark and redistributing it across my cluster, right? I finally figured out a solution to import external jars without rebuilding spark. One can modify two configurations inside spark-env.sh (at least for Netlib package, which include *.jar and *.so): spark. {driver,executor}.extraClassPath - for *.jar spark.{driver,executor} .extraLibraryPath - for *.so And spark (I'm using v1.5.2) will pick them up automatically
        Hide
        jonathak Jonathan Kelly added a comment -

        An alternative to this change could be to have another setting that allows you to configure the SparkSubmit classpath separately from the driver/executor classpaths. That way we wouldn't necessarily need to set the SparkSubmit classpath to include all of the libraries set in the driver classpath, which is the behavior this patch currently causes.

        Does anybody in the community have any thoughts/opinions on either of these approaches?

        Show
        jonathak Jonathan Kelly added a comment - An alternative to this change could be to have another setting that allows you to configure the SparkSubmit classpath separately from the driver/executor classpaths. That way we wouldn't necessarily need to set the SparkSubmit classpath to include all of the libraries set in the driver classpath, which is the behavior this patch currently causes. Does anybody in the community have any thoughts/opinions on either of these approaches?
        Hide
        jonathak Jonathan Kelly added a comment -
        Show
        jonathak Jonathan Kelly added a comment - Here's another patch that can be applied to v1.6.0: https://issues.apache.org/jira/secure/attachment/12779704/SPARK-10789.v1.6.0.diff
        Hide
        jonathak Jonathan Kelly added a comment -
        Show
        jonathak Jonathan Kelly added a comment - Roi Reshef , sure, here's a patch for my workaround: https://issues.apache.org/jira/secure/attachment/12778871/SPARK-10789.diff
        Hide
        roireshef Roi Reshef added a comment -

        Any resolution on that? Can you elaborate more on how were you able to bypass the limitations you described? I'm trying to add Netlib unsuccessfully. I'm also restricted to running the driver on yarn-local mode rather than yarn-cluster - will your temporary solution work with client (local) mode?

        Show
        roireshef Roi Reshef added a comment - Any resolution on that? Can you elaborate more on how were you able to bypass the limitations you described? I'm trying to add Netlib unsuccessfully. I'm also restricted to running the driver on yarn-local mode rather than yarn-cluster - will your temporary solution work with client (local) mode?

          People

          • Assignee:
            Unassigned
            Reporter:
            jonathak Jonathan Kelly
          • Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development