Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-33199

Mesos Task Failed when pyFiles and docker image option used together

    XMLWordPrintableJSON

Details

    • Bug
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 3.0.1
    • None
    • Mesos
    • None

    Description

      It is a bug related to shell escaping in MesosClusterScheduler. As you can see in this issue SPARK-23941, we are escaping all the options, that lead to escaping of the $MESOS_SANDBOX used in pyFiles in case of docker image option as well. This in turn causes FileNotFoundErrors during py4j.SparkSession.getOrCreate.

      In order to reproduce the bug, you need to have mesos and mesos dispatcher running. When you launch the following program:

      spark-submit --master mesos://127.0.1.1:7077 --deploy-mode cluster --class my_python_program.py --py-files /path/to/PySparkInclude.py --name my_python_program --conf spark.mesos.executor.docker.image=myrepo/myimage:mytag /path/to/my_python_program.py 100
      

      then the task failed with the following output:

      I1020 16:26:11.958860 17736 fetcher.cpp:562] Fetcher Info: {"cache_directory":"/tmp/mesos/fetch/nobody","items":[{"action":"BYPASS_CACHE","uri":{"cache":false,"extract":true,"value":"/path/to/my_python_program.py"}},{"action":"BYPASS_CACHE","uri":{"cache":false,"extract":true,"value":"/path/to/PySparkTestInclude.py"}}],"sandbox_directory":"/var/lib/mesos/slave/slaves/746d6707-8689-4698-9b93-7368ad523421-S6/frameworks/746d6707-8689-4698-9b93-7368ad523421-0016/executors/driver-20201020162611-0012/runs/f98449ad-ecba-4bc9-a979-99f21f227375","stall_timeout":{"nanoseconds":60000000000},"user":"nobody"}
      I1020 16:26:11.962540 17736 fetcher.cpp:459] Fetching URI '/path/to/my_python_program.py'
      I1020 16:26:11.962558 17736 fetcher.cpp:290] Fetching '/path/to/my_python_program.py' directly into the sandbox directory
      I1020 16:26:11.962584 17736 fetcher.cpp:175] Downloading resource from '/path/to/my_python_program.py' to '/var/lib/mesos/slave/slaves/746d6707-8689-4698-9b93-7368ad523421-S6/frameworks/746d6707-8689-4698-9b93-7368ad523421-0016/executors/driver-20201020162611-0012/runs/f98449ad-ecba-4bc9-a979-99f21f227375/my_python_program.py'
      W1020 16:26:12.126960 17736 fetcher.cpp:332] Copying instead of extracting resource from URI with 'extract' flag, because it does not seem to be an archive: /path/to/my_python_program.py
      I1020 16:26:12.126986 17736 fetcher.cpp:618] Fetched '/path/to/my_python_program.py' to '/var/lib/mesos/slave/slaves/746d6707-8689-4698-9b93-7368ad523421-S6/frameworks/746d6707-8689-4698-9b93-7368ad523421-0016/executors/driver-20201020162611-0012/runs/f98449ad-ecba-4bc9-a979-99f21f227375/my_python_program.py'
      I1020 16:26:12.126991 17736 fetcher.cpp:459] Fetching URI '/path/to/PySparkTestInclude.py'
      I1020 16:26:12.126996 17736 fetcher.cpp:290] Fetching '/path/to/PySparkTestInclude.py' directly into the sandbox directory
      I1020 16:26:12.127013 17736 fetcher.cpp:175] Downloading resource from '/path/to/PySparkTestInclude.py' to '/var/lib/mesos/slave/slaves/746d6707-8689-4698-9b93-7368ad523421-S6/frameworks/746d6707-8689-4698-9b93-7368ad523421-0016/executors/driver-20201020162611-0012/runs/f98449ad-ecba-4bc9-a979-99f21f227375/PySparkTestInclude.py'
      W1020 16:26:12.283085 17736 fetcher.cpp:332] Copying instead of extracting resource from URI with 'extract' flag, because it does not seem to be an archive: /path/to/PySparkTestInclude.py
      I1020 16:26:12.283108 17736 fetcher.cpp:618] Fetched '/path/to/PySparkTestInclude.py' to '/var/lib/mesos/slave/slaves/746d6707-8689-4698-9b93-7368ad523421-S6/frameworks/746d6707-8689-4698-9b93-7368ad523421-0016/executors/driver-20201020162611-0012/runs/f98449ad-ecba-4bc9-a979-99f21f227375/PySparkTestInclude.py'
      I1020 16:26:12.283114 17736 fetcher.cpp:623] Successfully fetched all URIs into '/var/lib/mesos/slave/slaves/746d6707-8689-4698-9b93-7368ad523421-S6/frameworks/746d6707-8689-4698-9b93-7368ad523421-0016/executors/driver-20201020162611-0012/runs/f98449ad-ecba-4bc9-a979-99f21f227375'
      Marked '/' as rslave
      W1020 16:26:12.389969     2 openssl.cpp:553] Failed SSL connections will be downgraded to a non-SSL socket
      W1020 16:26:12.390067     2 openssl.cpp:575] TLS certificate verification was enabled by setting one of LIBPROCESS_SSL_VERIFY_CERT or LIBPROCESS_SSL_REQUIRE_CERT, but can be bypassed because TLS downgrades are enabled.
      I1020 16:26:12.390079     2 openssl.cpp:599] CA directory path unspecified! NOTE: Set CA directory path with LIBPROCESS_SSL_CA_DIR=<dirpath>
      I1020 16:26:12.390087     2 openssl.cpp:623] Will use IP address verification in subject alternative name certificate extension.
      I1020 16:26:12.390095     2 openssl.cpp:642] Will verify server certificates for outgoing TLS connections.
      I1020 16:26:12.390105     2 openssl.cpp:663] Using 'openssl' scheme for hostname validation
      I1020 16:26:12.390331     2 openssl.cpp:688] Using CA file: /var/lib/mesos/slave/slaves/746d6707-8689-4698-9b93-7368ad523421-S6/frameworks/746d6707-8689-4698-9b93-7368ad523421-0016/executors/driver-20201020162611-0012/runs/f98449ad-ecba-4bc9-a979-99f21f227375/.ssl/ca-bundle.crt
      W1020 16:26:12.390651     2 process.cpp:1055] Failed SSL connections will be downgraded to a non-SSL socket
      I1020 16:26:12.392330     2 executor.cpp:210] Version: 1.10.1
      I1020 16:26:12.392472     2 executor.cpp:286] Using URL http+unix://%2Fvar%2Frun%2Fmesos%2Fmesos-executors.sock/slave(1)/api/v1/executor for the executor API endpoint
      I1020 16:26:12.398231     8 executor.cpp:190] Received SUBSCRIBED event
      I1020 16:26:12.398533     8 executor.cpp:194] Subscribed executor on 172.16.72.91
      I1020 16:26:12.398721     8 executor.cpp:190] Received LAUNCH event
      I1020 16:26:12.399922     8 executor.cpp:671] Overwriting environment variable 'LIBPROCESS_SSL_CA_FILE'
      I1020 16:26:12.399947     8 executor.cpp:671] Overwriting environment variable 'LIBPROCESS_SSL_CERT_FILE'
      I1020 16:26:12.399964     8 executor.cpp:671] Overwriting environment variable 'MESOS_NATIVE_JAVA_LIBRARY'
      I1020 16:26:12.399986     8 executor.cpp:671] Overwriting environment variable 'LIBPROCESS_SSL_KEY_FILE'
      I1020 16:26:12.400000     8 executor.cpp:671] Overwriting environment variable 'PATH'
      I1020 16:26:12.400053     8 executor.cpp:722] Starting task driver-20201020162611-0012
      I1020 16:26:12.401970     8 executor.cpp:740] Forked command at 12
      I1020 16:26:12.403370     9 executor.cpp:190] Received ACKNOWLEDGED event
      I1020 16:26:12.404062     9 executor.cpp:190] Received ACKNOWLEDGED event
      Preparing rootfs at '/var/lib/mesos/slave/provisioner/containers/f98449ad-ecba-4bc9-a979-99f21f227375/backends/overlay/rootfses/f80c3e69-73e8-4154-a8e5-8c68613f6462'
      Marked '/' as rslave
      Changing root to /var/lib/mesos/slave/provisioner/containers/f98449ad-ecba-4bc9-a979-99f21f227375/backends/overlay/rootfses/f80c3e69-73e8-4154-a8e5-8c68613f6462
      spark-env: Configured SPARK_LOCAL_IP with bootstrap: 172.16.72.91
      spark-env: User: nobody
      spark-env: No SPARK_MESOS_KRB5_CONF_BASE64 to decode
      spark-env: Configuring StatsD endpoint 198.51.100.1:46584 in /opt/spark/conf/metrics.properties
      20/10/20 16:26:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
      20/10/20 16:26:17 INFO SparkContext: Running Spark version 3.0.1
      20/10/20 16:26:17 INFO ResourceUtils: ==============================================================
      20/10/20 16:26:17 INFO ResourceUtils: Resources for spark.driver:
      20/10/20 16:26:17 INFO ResourceUtils: ==============================================================
      20/10/20 16:26:17 INFO SparkContext: Submitted application: PythonPi
      20/10/20 16:26:17 INFO SecurityManager: Changing view acls to: nobody
      20/10/20 16:26:17 INFO SecurityManager: Changing modify acls to: nobody
      20/10/20 16:26:17 INFO SecurityManager: Changing view acls groups to: 
      20/10/20 16:26:17 INFO SecurityManager: Changing modify acls groups to: 
      20/10/20 16:26:17 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(nobody); groups with view permissions: Set(); users  with modify permissions: Set(nobody); groups with modify permissions: Set()
      20/10/20 16:26:17 INFO Utils: Successfully started service 'sparkDriver' on port 37759.
      20/10/20 16:26:17 INFO SparkEnv: Registering MapOutputTracker
      20/10/20 16:26:18 INFO SparkEnv: Registering BlockManagerMaster
      20/10/20 16:26:18 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
      20/10/20 16:26:18 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
      20/10/20 16:26:18 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
      20/10/20 16:26:18 INFO DiskBlockManager: Created local directory at /mnt/mesos/sandbox/blockmgr-2be3fdd6-db33-49e8-bc0a-5b1da5ec8ded
      20/10/20 16:26:18 INFO MemoryStore: MemoryStore started with capacity 912.3 MiB
      20/10/20 16:26:18 INFO MetricsConfig: Loading metrics properties from resource metrics.properties
      20/10/20 16:26:18 INFO MetricsConfig: Metrics properties: {*.sink.statsd.class=org.apache.spark.metrics.sink.statsd.StatsdSink, *.sink.statsd.host=198.51.100.1, *.sink.servlet.class=org.apache.spark.metrics.sink.MetricsServlet, applications.sink.servlet.path=/metrics/applications/json, *.sink.statsd.tags=, *.source.jvm.class=org.apache.spark.metrics.source.JvmSource, *.sink.statsd.port=46584, master.sink.servlet.path=/metrics/master/json, *.sink.statsd.prefix=, *.sink.servlet.path=/metrics/json}
      20/10/20 16:26:18 INFO SparkEnv: Registering OutputCommitCoordinator
      20/10/20 16:26:18 INFO log: Logging initialized @5603ms to org.sparkproject.jetty.util.log.Slf4jLog
      20/10/20 16:26:18 INFO Server: jetty-9.4.z-SNAPSHOT; built: 2019-04-29T20:42:08.989Z; git: e1bc35120a6617ee3df052294e433f3a25ce7097; jvm 1.8.0_212-b03
      20/10/20 16:26:18 INFO Server: Started @5872ms
      20/10/20 16:26:18 INFO AbstractConnector: Started ServerConnector@1345a271{HTTP/1.1,[http/1.1]}{172.16.72.91:4040}
      20/10/20 16:26:18 INFO Utils: Successfully started service 'SparkUI' on port 4040.
      20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@37925df7{/jobs,null,AVAILABLE,@Spark}
      20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@31660e32{/jobs/json,null,AVAILABLE,@Spark}
      20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@68c3022c{/jobs/job,null,AVAILABLE,@Spark}
      20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@9662252{/jobs/job/json,null,AVAILABLE,@Spark}
      20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@75e43d99{/stages,null,AVAILABLE,@Spark}
      20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@74c915ac{/stages/json,null,AVAILABLE,@Spark}
      20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7d15bd53{/stages/stage,null,AVAILABLE,@Spark}
      20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@10d10bcf{/stages/stage/json,null,AVAILABLE,@Spark}
      20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6ce43fb0{/stages/pool,null,AVAILABLE,@Spark}
      20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@42f60f93{/stages/pool/json,null,AVAILABLE,@Spark}
      20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@228a2392{/storage,null,AVAILABLE,@Spark}
      20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@50576ab{/storage/json,null,AVAILABLE,@Spark}
      20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6de665b4{/storage/rdd,null,AVAILABLE,@Spark}
      20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@5fb35a9d{/storage/rdd/json,null,AVAILABLE,@Spark}
      20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@5bb2641{/environment,null,AVAILABLE,@Spark}
      20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@5f7aae9d{/environment/json,null,AVAILABLE,@Spark}
      20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@29b1337f{/executors,null,AVAILABLE,@Spark}
      20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7af3cf55{/executors/json,null,AVAILABLE,@Spark}
      20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@b196f0b{/executors/threadDump,null,AVAILABLE,@Spark}
      20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2383aabe{/executors/threadDump/json,null,AVAILABLE,@Spark}
      20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@545844d3{/static,null,AVAILABLE,@Spark}
      20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@1adf4e59{/,null,AVAILABLE,@Spark}
      20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2ae8fb6c{/api,null,AVAILABLE,@Spark}
      20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@21e2f70e{/jobs/job/kill,null,AVAILABLE,@Spark}
      20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6199fb32{/stages/stage/kill,null,AVAILABLE,@Spark}
      20/10/20 16:26:18 INFO SparkUI: Bound SparkUI to 172.16.72.91, and started at http://172.16.72.91:4040
      20/10/20 16:26:18 ERROR SparkContext: Error initializing SparkContext.
      java.io.FileNotFoundException: File file:/mnt/mesos/sandbox/$MESOS_SANDBOX/PySparkTestInclude.py does not exist
        at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:666)
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:987)
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:656)
        at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:454)
        at org.apache.spark.SparkContext.addFile(SparkContext.scala:1534)
        at org.apache.spark.SparkContext.addFile(SparkContext.scala:1498)
        at org.apache.spark.SparkContext.$anonfun$new$12(SparkContext.scala:494)
        at org.apache.spark.SparkContext.$anonfun$new$12$adapted(SparkContext.scala:494)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:494)
        at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:238)
        at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
        at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)
      20/10/20 16:26:18 INFO AbstractConnector: Stopped Spark@1345a271{HTTP/1.1,[http/1.1]}{172.16.72.91:4040}
      20/10/20 16:26:18 INFO SparkUI: Stopped Spark web UI at http://172.16.72.91:4040
      20/10/20 16:26:19 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
      20/10/20 16:26:19 INFO MemoryStore: MemoryStore cleared
      20/10/20 16:26:19 INFO BlockManager: BlockManager stopped
      20/10/20 16:26:19 INFO BlockManagerMaster: BlockManagerMaster stopped
      20/10/20 16:26:19 WARN MetricsSystem: Stopping a MetricsSystem that is not running
      20/10/20 16:26:19 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
      20/10/20 16:26:19 INFO SparkContext: Successfully stopped SparkContext
      20/10/20 16:26:19 INFO ShutdownHookManager: Shutdown hook called
      20/10/20 16:26:19 INFO ShutdownHookManager: Shutdown hook called
      20/10/20 16:26:19 INFO ShutdownHookManager: Deleting directory /tmp/spark-07db5724-6031-45d3-8222-c06947f80872
      20/10/20 16:26:19 INFO ShutdownHookManager: Deleting directory /mnt/mesos/sandbox/spark-d4145378-f0da-4f1b-8aee-a72d427ccb8f
      I1020 16:26:19.363329     3 executor.cpp:1041] Command exited with status 1 (pid: 12)
      I1020 16:26:19.366904     4 executor.cpp:190] Received ACKNOWLEDGED event
      I1020 16:26:19.367244     2 process.cpp:935] Stopped the socket accept loop
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            farhan5900 Farhan Khan
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: