Thanks to a lot of help from Benjamin Zaitlen and his blog post on this problem, I was able to develop a solution that works for Spark on YARN:
# Both these directories exist on all of our YARN nodes.
# Otherwise, everything else is built and shipped out at submit-time
# with our application.
python3 -m venv venv/
pip install -U pip
pip install -r requirements.pip
pip install -r requirements-dev.pip
# This convoluted zip machinery is to ensure that the paths to the files inside the zip
# look the same to Python when it runs within YARN.
# If there is a simpler way to express this, I'd be interested to know!
zip -rq ../venv.zip *
zip -rq ../myproject.zip *
zip -rq ../tests.zip *
--conf "spark.yarn.appMasterEnv.PYSPARK_PYTHON=venv/bin/python" \
--conf "spark.yarn.appMasterEnv.SPARK_HOME=$SPARK_HOME" \
--master yarn \
--deploy-mode client \
--archives "venv.zip#venv,myproject.zip#myproject,tests.zip#tests" \
My solution is based off of Ben's, except where Ben uses Conda I just use pip. I don't know if there is a way to adapt this solution to work with Spark on Mesos or Spark Standalone (and I haven't tried since my environment is YARN), but if someone figures it out please post your solution here!
As Ben explains in his blog post, this lets you build and ship an isolated environment with your PySpark application out to the YARN cluster. The YARN nodes don't even need to have the correct version of Python (or Python at all!) installed, because you are shipping out a complete Python environment via the --archives option.
I hope this helps some people who are looking for a workaround they can use today while a more robust solution is developed directly into Spark.
And I wonder... if this --archives technique can be extended or translated to Mesos and Standalone somehow, maybe that would be a good enough solution for the time being? People would be able to run their jobs in an isolated Python environment using their tool of choice (conda or pip), and Spark wouldn't need to add any virtualenv-specific machinery.