Details
-
New Feature
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.4.1
-
None
Description
An API to customize Python and R workers allows for extensibility beyond what can be expressed via static configs and environment variables like, e.g., spark.pyspark.python.
A use case for this is overriding PATH when using spark.archives with, say, conda-pack (as documented here). Some packages rely on binaries. And if we want to use those packages in Spark, we need to include their binaries in the PATH.
But we can't set the PATH via some config because 1) the environment with its binaries may be at a dynamic location (archives are unpacked on the driver into a directory with random name), and 2) we may not want to override the PATH that's pre-configured on the hosts.
Other use cases unlocked by this include overriding the executable dynamically (e.g., to select a version) or forking/redirecting the worker's output stream.
Attachments
Issue Links
- links to