Details
-
Bug
-
Status: In Progress
-
Major
-
Resolution: Unresolved
-
3.0.1
-
None
-
None
Description
It is a bug related to shell escaping in MesosClusterScheduler. As you can see in this issue SPARK-23941, we are escaping all the options, that lead to escaping of the $MESOS_SANDBOX used in pyFiles in case of docker image option as well. This in turn causes FileNotFoundErrors during py4j.SparkSession.getOrCreate.
In order to reproduce the bug, you need to have mesos and mesos dispatcher running. When you launch the following program:
spark-submit --master mesos://127.0.1.1:7077 --deploy-mode cluster --class my_python_program.py --py-files /path/to/PySparkInclude.py --name my_python_program --conf spark.mesos.executor.docker.image=myrepo/myimage:mytag /path/to/my_python_program.py 100
then the task failed with the following output:
I1020 16:26:11.958860 17736 fetcher.cpp:562] Fetcher Info: {"cache_directory":"/tmp/mesos/fetch/nobody","items":[{"action":"BYPASS_CACHE","uri":{"cache":false,"extract":true,"value":"/path/to/my_python_program.py"}},{"action":"BYPASS_CACHE","uri":{"cache":false,"extract":true,"value":"/path/to/PySparkTestInclude.py"}}],"sandbox_directory":"/var/lib/mesos/slave/slaves/746d6707-8689-4698-9b93-7368ad523421-S6/frameworks/746d6707-8689-4698-9b93-7368ad523421-0016/executors/driver-20201020162611-0012/runs/f98449ad-ecba-4bc9-a979-99f21f227375","stall_timeout":{"nanoseconds":60000000000},"user":"nobody"} I1020 16:26:11.962540 17736 fetcher.cpp:459] Fetching URI '/path/to/my_python_program.py' I1020 16:26:11.962558 17736 fetcher.cpp:290] Fetching '/path/to/my_python_program.py' directly into the sandbox directory I1020 16:26:11.962584 17736 fetcher.cpp:175] Downloading resource from '/path/to/my_python_program.py' to '/var/lib/mesos/slave/slaves/746d6707-8689-4698-9b93-7368ad523421-S6/frameworks/746d6707-8689-4698-9b93-7368ad523421-0016/executors/driver-20201020162611-0012/runs/f98449ad-ecba-4bc9-a979-99f21f227375/my_python_program.py' W1020 16:26:12.126960 17736 fetcher.cpp:332] Copying instead of extracting resource from URI with 'extract' flag, because it does not seem to be an archive: /path/to/my_python_program.py I1020 16:26:12.126986 17736 fetcher.cpp:618] Fetched '/path/to/my_python_program.py' to '/var/lib/mesos/slave/slaves/746d6707-8689-4698-9b93-7368ad523421-S6/frameworks/746d6707-8689-4698-9b93-7368ad523421-0016/executors/driver-20201020162611-0012/runs/f98449ad-ecba-4bc9-a979-99f21f227375/my_python_program.py' I1020 16:26:12.126991 17736 fetcher.cpp:459] Fetching URI '/path/to/PySparkTestInclude.py' I1020 16:26:12.126996 17736 fetcher.cpp:290] Fetching '/path/to/PySparkTestInclude.py' directly into the sandbox directory I1020 16:26:12.127013 17736 fetcher.cpp:175] Downloading resource from '/path/to/PySparkTestInclude.py' to '/var/lib/mesos/slave/slaves/746d6707-8689-4698-9b93-7368ad523421-S6/frameworks/746d6707-8689-4698-9b93-7368ad523421-0016/executors/driver-20201020162611-0012/runs/f98449ad-ecba-4bc9-a979-99f21f227375/PySparkTestInclude.py' W1020 16:26:12.283085 17736 fetcher.cpp:332] Copying instead of extracting resource from URI with 'extract' flag, because it does not seem to be an archive: /path/to/PySparkTestInclude.py I1020 16:26:12.283108 17736 fetcher.cpp:618] Fetched '/path/to/PySparkTestInclude.py' to '/var/lib/mesos/slave/slaves/746d6707-8689-4698-9b93-7368ad523421-S6/frameworks/746d6707-8689-4698-9b93-7368ad523421-0016/executors/driver-20201020162611-0012/runs/f98449ad-ecba-4bc9-a979-99f21f227375/PySparkTestInclude.py' I1020 16:26:12.283114 17736 fetcher.cpp:623] Successfully fetched all URIs into '/var/lib/mesos/slave/slaves/746d6707-8689-4698-9b93-7368ad523421-S6/frameworks/746d6707-8689-4698-9b93-7368ad523421-0016/executors/driver-20201020162611-0012/runs/f98449ad-ecba-4bc9-a979-99f21f227375' Marked '/' as rslave W1020 16:26:12.389969 2 openssl.cpp:553] Failed SSL connections will be downgraded to a non-SSL socket W1020 16:26:12.390067 2 openssl.cpp:575] TLS certificate verification was enabled by setting one of LIBPROCESS_SSL_VERIFY_CERT or LIBPROCESS_SSL_REQUIRE_CERT, but can be bypassed because TLS downgrades are enabled. I1020 16:26:12.390079 2 openssl.cpp:599] CA directory path unspecified! NOTE: Set CA directory path with LIBPROCESS_SSL_CA_DIR=<dirpath> I1020 16:26:12.390087 2 openssl.cpp:623] Will use IP address verification in subject alternative name certificate extension. I1020 16:26:12.390095 2 openssl.cpp:642] Will verify server certificates for outgoing TLS connections. I1020 16:26:12.390105 2 openssl.cpp:663] Using 'openssl' scheme for hostname validation I1020 16:26:12.390331 2 openssl.cpp:688] Using CA file: /var/lib/mesos/slave/slaves/746d6707-8689-4698-9b93-7368ad523421-S6/frameworks/746d6707-8689-4698-9b93-7368ad523421-0016/executors/driver-20201020162611-0012/runs/f98449ad-ecba-4bc9-a979-99f21f227375/.ssl/ca-bundle.crt W1020 16:26:12.390651 2 process.cpp:1055] Failed SSL connections will be downgraded to a non-SSL socket I1020 16:26:12.392330 2 executor.cpp:210] Version: 1.10.1 I1020 16:26:12.392472 2 executor.cpp:286] Using URL http+unix://%2Fvar%2Frun%2Fmesos%2Fmesos-executors.sock/slave(1)/api/v1/executor for the executor API endpoint I1020 16:26:12.398231 8 executor.cpp:190] Received SUBSCRIBED event I1020 16:26:12.398533 8 executor.cpp:194] Subscribed executor on 172.16.72.91 I1020 16:26:12.398721 8 executor.cpp:190] Received LAUNCH event I1020 16:26:12.399922 8 executor.cpp:671] Overwriting environment variable 'LIBPROCESS_SSL_CA_FILE' I1020 16:26:12.399947 8 executor.cpp:671] Overwriting environment variable 'LIBPROCESS_SSL_CERT_FILE' I1020 16:26:12.399964 8 executor.cpp:671] Overwriting environment variable 'MESOS_NATIVE_JAVA_LIBRARY' I1020 16:26:12.399986 8 executor.cpp:671] Overwriting environment variable 'LIBPROCESS_SSL_KEY_FILE' I1020 16:26:12.400000 8 executor.cpp:671] Overwriting environment variable 'PATH' I1020 16:26:12.400053 8 executor.cpp:722] Starting task driver-20201020162611-0012 I1020 16:26:12.401970 8 executor.cpp:740] Forked command at 12 I1020 16:26:12.403370 9 executor.cpp:190] Received ACKNOWLEDGED event I1020 16:26:12.404062 9 executor.cpp:190] Received ACKNOWLEDGED event Preparing rootfs at '/var/lib/mesos/slave/provisioner/containers/f98449ad-ecba-4bc9-a979-99f21f227375/backends/overlay/rootfses/f80c3e69-73e8-4154-a8e5-8c68613f6462' Marked '/' as rslave Changing root to /var/lib/mesos/slave/provisioner/containers/f98449ad-ecba-4bc9-a979-99f21f227375/backends/overlay/rootfses/f80c3e69-73e8-4154-a8e5-8c68613f6462 spark-env: Configured SPARK_LOCAL_IP with bootstrap: 172.16.72.91 spark-env: User: nobody spark-env: No SPARK_MESOS_KRB5_CONF_BASE64 to decode spark-env: Configuring StatsD endpoint 198.51.100.1:46584 in /opt/spark/conf/metrics.properties 20/10/20 16:26:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 20/10/20 16:26:17 INFO SparkContext: Running Spark version 3.0.1 20/10/20 16:26:17 INFO ResourceUtils: ============================================================== 20/10/20 16:26:17 INFO ResourceUtils: Resources for spark.driver: 20/10/20 16:26:17 INFO ResourceUtils: ============================================================== 20/10/20 16:26:17 INFO SparkContext: Submitted application: PythonPi 20/10/20 16:26:17 INFO SecurityManager: Changing view acls to: nobody 20/10/20 16:26:17 INFO SecurityManager: Changing modify acls to: nobody 20/10/20 16:26:17 INFO SecurityManager: Changing view acls groups to: 20/10/20 16:26:17 INFO SecurityManager: Changing modify acls groups to: 20/10/20 16:26:17 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(nobody); groups with view permissions: Set(); users with modify permissions: Set(nobody); groups with modify permissions: Set() 20/10/20 16:26:17 INFO Utils: Successfully started service 'sparkDriver' on port 37759. 20/10/20 16:26:17 INFO SparkEnv: Registering MapOutputTracker 20/10/20 16:26:18 INFO SparkEnv: Registering BlockManagerMaster 20/10/20 16:26:18 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 20/10/20 16:26:18 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up 20/10/20 16:26:18 INFO SparkEnv: Registering BlockManagerMasterHeartbeat 20/10/20 16:26:18 INFO DiskBlockManager: Created local directory at /mnt/mesos/sandbox/blockmgr-2be3fdd6-db33-49e8-bc0a-5b1da5ec8ded 20/10/20 16:26:18 INFO MemoryStore: MemoryStore started with capacity 912.3 MiB 20/10/20 16:26:18 INFO MetricsConfig: Loading metrics properties from resource metrics.properties 20/10/20 16:26:18 INFO MetricsConfig: Metrics properties: {*.sink.statsd.class=org.apache.spark.metrics.sink.statsd.StatsdSink, *.sink.statsd.host=198.51.100.1, *.sink.servlet.class=org.apache.spark.metrics.sink.MetricsServlet, applications.sink.servlet.path=/metrics/applications/json, *.sink.statsd.tags=, *.source.jvm.class=org.apache.spark.metrics.source.JvmSource, *.sink.statsd.port=46584, master.sink.servlet.path=/metrics/master/json, *.sink.statsd.prefix=, *.sink.servlet.path=/metrics/json} 20/10/20 16:26:18 INFO SparkEnv: Registering OutputCommitCoordinator 20/10/20 16:26:18 INFO log: Logging initialized @5603ms to org.sparkproject.jetty.util.log.Slf4jLog 20/10/20 16:26:18 INFO Server: jetty-9.4.z-SNAPSHOT; built: 2019-04-29T20:42:08.989Z; git: e1bc35120a6617ee3df052294e433f3a25ce7097; jvm 1.8.0_212-b03 20/10/20 16:26:18 INFO Server: Started @5872ms 20/10/20 16:26:18 INFO AbstractConnector: Started ServerConnector@1345a271{HTTP/1.1,[http/1.1]}{172.16.72.91:4040} 20/10/20 16:26:18 INFO Utils: Successfully started service 'SparkUI' on port 4040. 20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@37925df7{/jobs,null,AVAILABLE,@Spark} 20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@31660e32{/jobs/json,null,AVAILABLE,@Spark} 20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@68c3022c{/jobs/job,null,AVAILABLE,@Spark} 20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@9662252{/jobs/job/json,null,AVAILABLE,@Spark} 20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@75e43d99{/stages,null,AVAILABLE,@Spark} 20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@74c915ac{/stages/json,null,AVAILABLE,@Spark} 20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7d15bd53{/stages/stage,null,AVAILABLE,@Spark} 20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@10d10bcf{/stages/stage/json,null,AVAILABLE,@Spark} 20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6ce43fb0{/stages/pool,null,AVAILABLE,@Spark} 20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@42f60f93{/stages/pool/json,null,AVAILABLE,@Spark} 20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@228a2392{/storage,null,AVAILABLE,@Spark} 20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@50576ab{/storage/json,null,AVAILABLE,@Spark} 20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6de665b4{/storage/rdd,null,AVAILABLE,@Spark} 20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@5fb35a9d{/storage/rdd/json,null,AVAILABLE,@Spark} 20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@5bb2641{/environment,null,AVAILABLE,@Spark} 20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@5f7aae9d{/environment/json,null,AVAILABLE,@Spark} 20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@29b1337f{/executors,null,AVAILABLE,@Spark} 20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7af3cf55{/executors/json,null,AVAILABLE,@Spark} 20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@b196f0b{/executors/threadDump,null,AVAILABLE,@Spark} 20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2383aabe{/executors/threadDump/json,null,AVAILABLE,@Spark} 20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@545844d3{/static,null,AVAILABLE,@Spark} 20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@1adf4e59{/,null,AVAILABLE,@Spark} 20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2ae8fb6c{/api,null,AVAILABLE,@Spark} 20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@21e2f70e{/jobs/job/kill,null,AVAILABLE,@Spark} 20/10/20 16:26:18 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6199fb32{/stages/stage/kill,null,AVAILABLE,@Spark} 20/10/20 16:26:18 INFO SparkUI: Bound SparkUI to 172.16.72.91, and started at http://172.16.72.91:4040 20/10/20 16:26:18 ERROR SparkContext: Error initializing SparkContext. java.io.FileNotFoundException: File file:/mnt/mesos/sandbox/$MESOS_SANDBOX/PySparkTestInclude.py does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:666) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:987) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:656) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:454) at org.apache.spark.SparkContext.addFile(SparkContext.scala:1534) at org.apache.spark.SparkContext.addFile(SparkContext.scala:1498) at org.apache.spark.SparkContext.$anonfun$new$12(SparkContext.scala:494) at org.apache.spark.SparkContext.$anonfun$new$12$adapted(SparkContext.scala:494) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.SparkContext.<init>(SparkContext.scala:494) at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:238) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) 20/10/20 16:26:18 INFO AbstractConnector: Stopped Spark@1345a271{HTTP/1.1,[http/1.1]}{172.16.72.91:4040} 20/10/20 16:26:18 INFO SparkUI: Stopped Spark web UI at http://172.16.72.91:4040 20/10/20 16:26:19 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 20/10/20 16:26:19 INFO MemoryStore: MemoryStore cleared 20/10/20 16:26:19 INFO BlockManager: BlockManager stopped 20/10/20 16:26:19 INFO BlockManagerMaster: BlockManagerMaster stopped 20/10/20 16:26:19 WARN MetricsSystem: Stopping a MetricsSystem that is not running 20/10/20 16:26:19 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 20/10/20 16:26:19 INFO SparkContext: Successfully stopped SparkContext 20/10/20 16:26:19 INFO ShutdownHookManager: Shutdown hook called 20/10/20 16:26:19 INFO ShutdownHookManager: Shutdown hook called 20/10/20 16:26:19 INFO ShutdownHookManager: Deleting directory /tmp/spark-07db5724-6031-45d3-8222-c06947f80872 20/10/20 16:26:19 INFO ShutdownHookManager: Deleting directory /mnt/mesos/sandbox/spark-d4145378-f0da-4f1b-8aee-a72d427ccb8f I1020 16:26:19.363329 3 executor.cpp:1041] Command exited with status 1 (pid: 12) I1020 16:26:19.366904 4 executor.cpp:190] Received ACKNOWLEDGED event I1020 16:26:19.367244 2 process.cpp:935] Stopped the socket accept loop