Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.0.0, 3.1.3, 3.3.0, 3.2.2, 3.4.0
-
None
-
Spark 3.1.3
Kubernetes 1.21
Ubuntu 20.04.1
Description
I discovered that remote URIs in spark.jars get discarded when launching Spark on Kubernetes in cluster mode via spark-submit.
Reproduction
Here is an example reproduction with S3 being used for remote JAR storage:
I first created 2 JARs:
- /opt/my-local-jar.jar on the host where I'm running spark-submit
- s3://$BUCKET_NAME/my-remote-jar.jar in an S3 bucket I own
I then ran the following spark-submit command with spark.jars pointing to both the local JAR and the remote JAR:
spark-submit \ --master k8s://https://$KUBERNETES_API_SERVER_URL:443 \ --deploy-mode cluster \ --name=spark-submit-test \ --class org.apache.spark.examples.SparkPi \ --conf spark.jars=/opt/my-local-jar.jar,s3a://$BUCKET_NAME/my-remote-jar.jar \ --conf spark.kubernetes.file.upload.path=s3a://$BUCKET_NAME/my-upload-path/ \ [...] /opt/spark/examples/jars/spark-examples_2.12-3.1.3.jar
Once the driver and the executors started, I confirmed that there was no trace of my-remote-jar.jar anymore. For example, looking at the Spark History Server, I could see that spark.jars got transformed into this:
There was no mention of my-remote-jar.jar on the classpath or anywhere else.
Note that I ran all tests with Spark 3.1.3, however the code which handles those dependencies seems to be the same for more recent versions of Spark as well.
Root cause description
I believe that the issue seems to be coming from this logic in BasicDriverFeatureStep.getAdditionalPodSystemProperties().
Specifically, this logic takes all URIs in spark.jars, filters only on local URIs, uploads those local files to spark.kubernetes.file.upload.path }}and then replaces the value of {{spark.jars with those newly uploaded JARs. By overwriting the previous value of spark.jars, we are losing all mention of remote JARs that were previously specified there.
Consequently, when the Spark driver starts afterwards, it only downloads JARs from spark.kubernetes.file.upload.path.
Possible solution
I think a possible fix would be to not fully overwrite the value of spark.jars but to make sure that we keep remote URIs there.
The new logic would look something like this:
Seq(JARS, FILES, ARCHIVES, SUBMIT_PYTHON_FILES).foreach { key => val uris = conf.get(key).filter(uri => KubernetesUtils.isLocalAndResolvable(uri)) // Save remote URIs val remoteUris = conf.get(key).filter(uri => !KubernetesUtils.isLocalAndResolvable(uri)) val value = { if (key == ARCHIVES) { uris.map(UriBuilder.fromUri(_).fragment(null).build()).map(_.toString) } else { uris } } val resolved = KubernetesUtils.uploadAndTransformFileUris(value, Some(conf.sparkConf)) if (resolved.nonEmpty) { val resolvedValue = if (key == ARCHIVES) { uris.zip(resolved).map { case (uri, r) => UriBuilder.fromUri(r).fragment(new java.net.URI(uri).getFragment).build().toString } } else { resolved } // don't forget to add remote URIs additionalProps.put(key.key, (resolvedValue ++ remoteUris).mkString(",")) } }
I tested it out in my environment and it worked: s3a://$BUCKET_NAME/my-remote-jar.jar was kept in spark.jars and the Spark driver was able to download it.
I don't know the codebase well enough though to assess whether I am missing something else or if this is enough to fix the issue.