XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.0.0, 3.1.3, 3.3.0, 3.2.2, 3.4.0
Fix Version/s: 3.2.4, 3.3.2, 3.4.0
Component/s: Kubernetes, Spark Submit
Labels:
None
Environment:

Spark 3.1.3

Kubernetes 1.21

Ubuntu 20.04.1

Language:
- scala

Description

I discovered that remote URIs in spark.jars get discarded when launching Spark on Kubernetes in cluster mode via spark-submit.

Reproduction

Here is an example reproduction with S3 being used for remote JAR storage:

I first created 2 JARs:

/opt/my-local-jar.jar on the host where I'm running spark-submit
s3://$BUCKET_NAME/my-remote-jar.jar in an S3 bucket I own

I then ran the following spark-submit command with spark.jars pointing to both the local JAR and the remote JAR:

 spark-submit \
  --master k8s://https://$KUBERNETES_API_SERVER_URL:443 \
  --deploy-mode cluster \
  --name=spark-submit-test \
  --class org.apache.spark.examples.SparkPi \
  --conf spark.jars=/opt/my-local-jar.jar,s3a://$BUCKET_NAME/my-remote-jar.jar \
  --conf spark.kubernetes.file.upload.path=s3a://$BUCKET_NAME/my-upload-path/ \
  [...]
  /opt/spark/examples/jars/spark-examples_2.12-3.1.3.jar

Once the driver and the executors started, I confirmed that there was no trace of my-remote-jar.jar anymore. For example, looking at the Spark History Server, I could see that spark.jars got transformed into this:

There was no mention of my-remote-jar.jar on the classpath or anywhere else.

Note that I ran all tests with Spark 3.1.3, however the code which handles those dependencies seems to be the same for more recent versions of Spark as well.

Root cause description

I believe that the issue seems to be coming from this logic in BasicDriverFeatureStep.getAdditionalPodSystemProperties().

Specifically, this logic takes all URIs in spark.jars, filters only on local URIs, uploads those local files to spark.kubernetes.file.upload.path }}and then replaces the value of {{spark.jars with those newly uploaded JARs. By overwriting the previous value of spark.jars, we are losing all mention of remote JARs that were previously specified there.

Consequently, when the Spark driver starts afterwards, it only downloads JARs from spark.kubernetes.file.upload.path.

Possible solution

I think a possible fix would be to not fully overwrite the value of spark.jars but to make sure that we keep remote URIs there.

The new logic would look something like this:

Seq(JARS, FILES, ARCHIVES, SUBMIT_PYTHON_FILES).foreach { key =>
  val uris = conf.get(key).filter(uri => KubernetesUtils.isLocalAndResolvable(uri))
  // Save remote URIs
  val remoteUris = conf.get(key).filter(uri => !KubernetesUtils.isLocalAndResolvable(uri))
  val value = {
    if (key == ARCHIVES) {
      uris.map(UriBuilder.fromUri(_).fragment(null).build()).map(_.toString)
    } else {
      uris
    }
  }
  val resolved = KubernetesUtils.uploadAndTransformFileUris(value, Some(conf.sparkConf))
  if (resolved.nonEmpty) {
    val resolvedValue = if (key == ARCHIVES) {
      uris.zip(resolved).map { case (uri, r) =>
        UriBuilder.fromUri(r).fragment(new java.net.URI(uri).getFragment).build().toString
      }
    } else {
      resolved
    }
    // don't forget to add remote URIs
    additionalProps.put(key.key, (resolvedValue ++ remoteUris).mkString(","))
  }
}

I tested it out in my environment and it worked: s3a://$BUCKET_NAME/my-remote-jar.jar was kept in spark.jars and the Spark driver was able to download it.

I don't know the codebase well enough though to assess whether I am missing something else or if this is enough to fix the issue.