Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-33425

Credentials are not passed in the `doFetchFile` when running spark-submit with https url

    XMLWordPrintableJSON

Details

    • Bug
    • Status: In Progress
    • Minor
    • Resolution: Unresolved
    • 3.0.1
    • None
    • Input/Output
    • None

    Description

      I'm running spark-submit https url containing username and password. It's said in the documentation - https://spark.a pache.org/docs/latest/submitting-applications.html

      (Note that credentials for password-protected repositories can be supplied in some cases in the repository URI, such as in https://user:password@host/.... Be careful when supplying credentials this way.)

      However, when using that, I receive the following error:

       

      
      INFO - 20/11/11 12:59:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
      INFO - Exception in thread "main" java.io.IOException: Server returned HTTP response code: 401 for URL: https://username:*****@host.com/my_app/pipeline.jar
      INFO - at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1924)
      INFO - at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1520)
      INFO - at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:250)
      INFO - at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:729)
      INFO - at org.apache.spark.deploy.DependencyUtils$.downloadFile(DependencyUtils.scala:138)
      INFO - at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$8(SparkSubmit.scala:376)
      INFO - at scala.Option.map(Option.scala:230)
      INFO - at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:376)
      INFO - at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871)
      INFO - at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
      INFO - at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
      INFO - at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
      INFO - at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
      INFO - at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
      INFO - at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
      

       

      When downloading my file manually using wget, at first I receive a 401 error but then there's a retry with credentials:

      
      HTTP request sent, awaiting response... 401 Unauthorized
      Authentication selected: Basic realm="Restricted"
      Reusing existing connection to host.com:443.
      HTTP request sent, awaiting response... 200 OK
      
      

      When I do use ` --auth-no-challenge` in wget the credentials are passed directly in the first request and I receive 200 OK. The problem with the first wget is that, it tries to download a file without passing credentials and after 401 it's challenged to pass credentials so it goes in two steps. That is similar to my issue where credentials are not passed in the first query.

      Attachments

        Activity

          People

            Unassigned Unassigned
            pprzetacznik Piotr Przetacznik
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: