Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-33425

Credentials are not passed in the `doFetchFile` when running spark-submit with https url



    • Type: Bug
    • Status: In Progress
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 3.0.1
    • Fix Version/s: None
    • Component/s: Input/Output
    • Labels:


      I'm running spark-submit https url containing username and password. It's said in the documentation - https://spark.a pache.org/docs/latest/submitting-applications.html

      (Note that credentials for password-protected repositories can be supplied in some cases in the repository URI, such as in https://user:password@host/.... Be careful when supplying credentials this way.)

      However, when using that, I receive the following error:


      INFO - 20/11/11 12:59:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
      INFO - Exception in thread "main" java.io.IOException: Server returned HTTP response code: 401 for URL: https://username:*****@host.com/my_app/pipeline.jar
      INFO - at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1924)
      INFO - at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1520)
      INFO - at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:250)
      INFO - at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:729)
      INFO - at org.apache.spark.deploy.DependencyUtils$.downloadFile(DependencyUtils.scala:138)
      INFO - at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$8(SparkSubmit.scala:376)
      INFO - at scala.Option.map(Option.scala:230)
      INFO - at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:376)
      INFO - at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871)
      INFO - at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
      INFO - at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
      INFO - at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
      INFO - at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
      INFO - at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
      INFO - at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


      When downloading my file manually using wget, at first I receive a 401 error but then there's a retry with credentials:

      HTTP request sent, awaiting response... 401 Unauthorized
      Authentication selected: Basic realm="Restricted"
      Reusing existing connection to host.com:443.
      HTTP request sent, awaiting response... 200 OK

      When I do use ` --auth-no-challenge` in wget the credentials are passed directly in the first request and I receive 200 OK. The problem with the first wget is that, it tries to download a file without passing credentials and after 401 it's challenged to pass credentials so it goes in two steps. That is similar to my issue where credentials are not passed in the first query.




            • Assignee:
              pprzetacznik Piotr Przetacznik
            • Votes:
              0 Vote for this issue
              2 Start watching this issue


              • Created: