Details

    • Type: New Feature
    • Status: In Progress
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: PySpark
    • Labels:
      None

      Description

      Currently, it's not easy for user to add third party python packages in pyspark.

      • One way is to using --py-files (suitable for simple dependency, but not suitable for complicated dependency, especially with transitive dependency)
      • Another way is install packages manually on each node (time wasting, and not easy to switch to different environment)

      Python has now 2 different virtualenv implementation. One is native virtualenv another is through conda. This jira is trying to migrate these 2 tools to distributed environment

        Issue Links

          Activity

          Hide
          gaetan@xeberon.net Semet added a comment - - edited

          Hello. For me this solution is equivalent with my "Wheelhouse" (SPARK-16367) proposal I made, even without having to modify pyspark at all. I even think you can package a wheelhouse using this --archive argument.
          The drawback is indeed your spark-submit has to send this package to each node (1 to n). If Pyspark supported requirements.txt/Pipfile dependencies description formats, each node would download by itself the dependencies...
          The strong argument for wheelhouse is that is only packages the libraries used by the project, not the complete environment. The drawback is that it may not work well with anaconda.

          Show
          gaetan@xeberon.net Semet added a comment - - edited Hello. For me this solution is equivalent with my "Wheelhouse" ( SPARK-16367 ) proposal I made, even without having to modify pyspark at all. I even think you can package a wheelhouse using this --archive argument. The drawback is indeed your spark-submit has to send this package to each node (1 to n). If Pyspark supported requirements.txt / Pipfile dependencies description formats, each node would download by itself the dependencies... The strong argument for wheelhouse is that is only packages the libraries used by the project, not the complete environment. The drawback is that it may not work well with anaconda.
          Hide
          nchammas Nicholas Chammas added a comment -

          To follow-up on my earlier comment, I created a completely self-contained sample repo demonstrating a technique for bundling PySpark app dependencies in an isolated way. It's the technique that Ben, I, and several others discussed here in this JIRA issue.

          https://github.com/massmutual/sample-pyspark-application

          The approach has advantages (like letting you ship a completely isolated Python environment, so you don't even need Python installed on the workers) and disadvantages (requires YARN; increases job startup time). Hope some of you find the sample repo useful until Spark adds more "first-class" support for Python dependency isolation.

          Show
          nchammas Nicholas Chammas added a comment - To follow-up on my earlier comment , I created a completely self-contained sample repo demonstrating a technique for bundling PySpark app dependencies in an isolated way. It's the technique that Ben, I, and several others discussed here in this JIRA issue. https://github.com/massmutual/sample-pyspark-application The approach has advantages (like letting you ship a completely isolated Python environment, so you don't even need Python installed on the workers) and disadvantages (requires YARN; increases job startup time). Hope some of you find the sample repo useful until Spark adds more "first-class" support for Python dependency isolation.
          Hide
          zjffdu Jeff Zhang added a comment -

          I linked a detailed document about how to use it in both batch mode and interactive mode, if anyone is interested in this, you can try this PR.

          Show
          zjffdu Jeff Zhang added a comment - I linked a detailed document about how to use it in both batch mode and interactive mode, if anyone is interested in this, you can try this PR.
          Hide
          zjffdu Jeff Zhang added a comment -

          If it is pretty large cluster, then I would suggest to set up a private pypi repository server.

          Show
          zjffdu Jeff Zhang added a comment - If it is pretty large cluster, then I would suggest to set up a private pypi repository server.
          Hide
          tsp Prasanna Santhanam added a comment -

          Jeff Zhang In case of Anaconda Python the environment is self-contained. The conda environment is managed using binaries that are hardlinked unlike in virtualenv. So zipping the conda environment zips the entire python system together. After that Spark only needs to know that the binary for Python it should use is relative to the archives that were distributed. Hence I was able to run my spark application without installing anaconda on any of my workers.

          Let's give these mechanisms names so it makes it easier to compare. On the one hand we have the "Library Distribution Mechanism" and on the other we have the "Library Download Mechanism". Right now, the overhead is greater in the case of the distribution mechanism because zipping the binaries eats up significant time before application startup. This was my observation with the 16core machine, YMMV. The library distribution mechanism can however be improved with some basic caching done on the gateway node so that subsequent zip operations on a different spark application with similar library requirements is faster.

          On the other hand, in the library download mechanism - the downloads are very fast and can even be cached locally or proxied to a locally managed egg/pypi repository/conda channel. So the download mechanism is still superior save for some complexity in the implementation and options to be specified. However, I'm not sure whether downloads will be throttled on publicly exposed repositories like PyPi when say a 1000 node spark cluster is simultaneously requesting for python binaries to be downloaded from all its workers.

          Show
          tsp Prasanna Santhanam added a comment - Jeff Zhang In case of Anaconda Python the environment is self-contained. The conda environment is managed using binaries that are hardlinked unlike in virtualenv. So zipping the conda environment zips the entire python system together. After that Spark only needs to know that the binary for Python it should use is relative to the archives that were distributed. Hence I was able to run my spark application without installing anaconda on any of my workers. Let's give these mechanisms names so it makes it easier to compare. On the one hand we have the "Library Distribution Mechanism" and on the other we have the "Library Download Mechanism". Right now, the overhead is greater in the case of the distribution mechanism because zipping the binaries eats up significant time before application startup. This was my observation with the 16core machine, YMMV. The library distribution mechanism can however be improved with some basic caching done on the gateway node so that subsequent zip operations on a different spark application with similar library requirements is faster. On the other hand, in the library download mechanism - the downloads are very fast and can even be cached locally or proxied to a locally managed egg/pypi repository/conda channel. So the download mechanism is still superior save for some complexity in the implementation and options to be specified. However, I'm not sure whether downloads will be throttled on publicly exposed repositories like PyPi when say a 1000 node spark cluster is simultaneously requesting for python binaries to be downloaded from all its workers.
          Hide
          zjffdu Jeff Zhang added a comment -

          Prasanna Santhanam I don't understand how this can work without python installed in the worker nodes. And regarding the overhead, I think no matter we dist the dependency or download the dependency. The overhead can not avoided. From my experience, one advantage of the downloading approach is that it would cache the dependencies on the worker node. So if the executor runs on one node that the dependency is cached there, it would save lot of time to set up the virtualenv.

          Show
          zjffdu Jeff Zhang added a comment - Prasanna Santhanam I don't understand how this can work without python installed in the worker nodes. And regarding the overhead, I think no matter we dist the dependency or download the dependency. The overhead can not avoided. From my experience, one advantage of the downloading approach is that it would cache the dependencies on the worker node. So if the executor runs on one node that the dependency is cached there, it would save lot of time to set up the virtualenv.
          Hide
          tsp Prasanna Santhanam added a comment -

          Nicholas Chammas sorry, this got buried in several other emails at my org.

          What you've implemented as a shell installer is exactly what I've done except within Spark code branched off of the 2.0.1 release. I use the YARN archives mechanism to distribute the zip files and control the conda environment binaries. It should be straightforward to change my diff to work with virtualenv as well. As you've explained the advantage of this process is that Python doesn't need to be installed at all in the worker nodes. I've also implemented the mechanism with --py-files so standalone spark can take advantage but I haven't got around to testing it yet.

          The downside of the zip distribution solution is however that the start time of the application significantly increased - nearly 4m to zip all libraries. I tested this on a 16 core machine with 30GB memory. When I try to zip the binaries it ends up creating a 400MB archive for just basic libaries like matplotlib, scipy, numpy. What zip times did you experience?

          Much to my surprise, contrasted with the original proposal by Jeff Zhang of downloading the dependencies on all the workers, this eats up significant time in a spark program that runs for no greater than 2s. This restricted me from pushing the implementation further. Would like to hear your observations from testing your shell implementation of the same.

          Show
          tsp Prasanna Santhanam added a comment - Nicholas Chammas sorry, this got buried in several other emails at my org. What you've implemented as a shell installer is exactly what I've done except within Spark code branched off of the 2.0.1 release. I use the YARN archives mechanism to distribute the zip files and control the conda environment binaries. It should be straightforward to change my diff to work with virtualenv as well. As you've explained the advantage of this process is that Python doesn't need to be installed at all in the worker nodes. I've also implemented the mechanism with --py-files so standalone spark can take advantage but I haven't got around to testing it yet. The downside of the zip distribution solution is however that the start time of the application significantly increased - nearly 4m to zip all libraries. I tested this on a 16 core machine with 30GB memory. When I try to zip the binaries it ends up creating a 400MB archive for just basic libaries like matplotlib , scipy , numpy . What zip times did you experience? Much to my surprise, contrasted with the original proposal by Jeff Zhang of downloading the dependencies on all the workers, this eats up significant time in a spark program that runs for no greater than 2s. This restricted me from pushing the implementation further. Would like to hear your observations from testing your shell implementation of the same.
          Hide
          nchammas Nicholas Chammas added a comment -

          Thanks to a lot of help from Benjamin Zaitlen and his blog post on this problem, I was able to develop a solution that works for Spark on YARN:

          set -e
          
          # Both these directories exist on all of our YARN nodes.
          # Otherwise, everything else is built and shipped out at submit-time
          # with our application.
          export HADOOP_CONF_DIR="/etc/hadoop/conf"
          export SPARK_HOME="/hadoop/spark/spark-2.0.2-bin-hadoop2.6"
          
          export PATH="$SPARK_HOME/bin:$PATH"
          
          python3 -m venv venv/
          source venv/bin/activate
          
          pip install -U pip
          pip install -r requirements.pip
          pip install -r requirements-dev.pip
          
          deactivate
          
          # This convoluted zip machinery is to ensure that the paths to the files inside the zip
          # look the same to Python when it runs within YARN.
          # If there is a simpler way to express this, I'd be interested to know!
          pushd venv/
          zip -rq ../venv.zip *
          popd
          pushd myproject/
          zip -rq ../myproject.zip *
          popd
          pushd tests/
          zip -rq ../tests.zip *
          popd
          
          export PYSPARK_PYTHON="venv/bin/python"
          
          spark-submit \
            --conf "spark.yarn.appMasterEnv.PYSPARK_PYTHON=venv/bin/python" \
            --conf "spark.yarn.appMasterEnv.SPARK_HOME=$SPARK_HOME" \
            --master yarn \
            --deploy-mode client \
            --archives "venv.zip#venv,myproject.zip#myproject,tests.zip#tests" \
            run_tests.py -v
          

          My solution is based off of Ben's, except where Ben uses Conda I just use pip. I don't know if there is a way to adapt this solution to work with Spark on Mesos or Spark Standalone (and I haven't tried since my environment is YARN), but if someone figures it out please post your solution here!

          As Ben explains in his blog post, this lets you build and ship an isolated environment with your PySpark application out to the YARN cluster. The YARN nodes don't even need to have the correct version of Python (or Python at all!) installed, because you are shipping out a complete Python environment via the --archives option.

          I hope this helps some people who are looking for a workaround they can use today while a more robust solution is developed directly into Spark.

          And I wonder... if this --archives technique can be extended or translated to Mesos and Standalone somehow, maybe that would be a good enough solution for the time being? People would be able to run their jobs in an isolated Python environment using their tool of choice (conda or pip), and Spark wouldn't need to add any virtualenv-specific machinery.

          Show
          nchammas Nicholas Chammas added a comment - Thanks to a lot of help from Benjamin Zaitlen and his blog post on this problem , I was able to develop a solution that works for Spark on YARN: set -e # Both these directories exist on all of our YARN nodes. # Otherwise, everything else is built and shipped out at submit-time # with our application. export HADOOP_CONF_DIR= "/etc/hadoop/conf" export SPARK_HOME= "/hadoop/spark/spark-2.0.2-bin-hadoop2.6" export PATH= "$SPARK_HOME/bin:$PATH" python3 -m venv venv/ source venv/bin/activate pip install -U pip pip install -r requirements.pip pip install -r requirements-dev.pip deactivate # This convoluted zip machinery is to ensure that the paths to the files inside the zip # look the same to Python when it runs within YARN. # If there is a simpler way to express this , I'd be interested to know! pushd venv/ zip -rq ../venv.zip * popd pushd myproject/ zip -rq ../myproject.zip * popd pushd tests/ zip -rq ../tests.zip * popd export PYSPARK_PYTHON= "venv/bin/python" spark-submit \ --conf "spark.yarn.appMasterEnv.PYSPARK_PYTHON=venv/bin/python" \ --conf "spark.yarn.appMasterEnv.SPARK_HOME=$SPARK_HOME" \ --master yarn \ --deploy-mode client \ --archives "venv.zip#venv,myproject.zip#myproject,tests.zip#tests" \ run_tests.py -v My solution is based off of Ben's, except where Ben uses Conda I just use pip. I don't know if there is a way to adapt this solution to work with Spark on Mesos or Spark Standalone (and I haven't tried since my environment is YARN), but if someone figures it out please post your solution here! As Ben explains in his blog post , this lets you build and ship an isolated environment with your PySpark application out to the YARN cluster. The YARN nodes don't even need to have the correct version of Python (or Python at all!) installed, because you are shipping out a complete Python environment via the --archives option. I hope this helps some people who are looking for a workaround they can use today while a more robust solution is developed directly into Spark. And I wonder... if this --archives technique can be extended or translated to Mesos and Standalone somehow, maybe that would be a good enough solution for the time being? People would be able to run their jobs in an isolated Python environment using their tool of choice (conda or pip), and Spark wouldn't need to add any virtualenv-specific machinery.
          Hide
          gaetan@xeberon.net Semet added a comment -

          For myself, I share a NFS folder with all the executors. It works because they all have the same architecture and distribution.

          Frankly, I begin to be a bit disapointed there is no infatuation, no real will to solve this huge hole in PySpark. Dependency management has been solved years ago in Python with Virtualenv in general and with Anaconda in Data Science, but PySpark still continue to play with the PYTHONPATH and there is no Spark core developer actively involved to help us integrating such patch. Dependency management for JAR are modernly handled by --packages, automatically downloading the files from a remote repository, why not doing that for Python as well? And maybe R as well if available? I even proposed a way to package everything in a single zip archive, called "wheelhouse", so executors might not have to download anything.

          So, please help us raising this concern to core developers to tell them that there are several persons interested in solving this issue.

          Show
          gaetan@xeberon.net Semet added a comment - For myself, I share a NFS folder with all the executors. It works because they all have the same architecture and distribution. Frankly, I begin to be a bit disapointed there is no infatuation, no real will to solve this huge hole in PySpark. Dependency management has been solved years ago in Python with Virtualenv in general and with Anaconda in Data Science, but PySpark still continue to play with the PYTHONPATH and there is no Spark core developer actively involved to help us integrating such patch. Dependency management for JAR are modernly handled by --packages , automatically downloading the files from a remote repository, why not doing that for Python as well? And maybe R as well if available? I even proposed a way to package everything in a single zip archive, called "wheelhouse", so executors might not have to download anything. So, please help us raising this concern to core developers to tell them that there are several persons interested in solving this issue.
          Hide
          nchammas Nicholas Chammas added a comment -

          Prasanna Santhanam:

          Previously, I have had reasonable success with zipping the contents of my conda environment in the gateway/driver node and submitting the zip file as an argument to --archives in the spark-submit command line. This approach works perfectly because it uses the existing spark infrastructure to distribute dependencies through to the workers. You actually don't even need anaconda installed on the workers since the zip can package the entire python installation within it. The downside of it being that conda zip files can bloat up quickly in a production spark application.

          Can you elaborate on how you did this? I'm willing to jump through some hoops to create a hackish way of distributing dependencies while this JIRA task gets worked out.

          What I'm trying is:

          1. Create a virtual environment and activate it.
          2. Pip install my requirements into that environment, as one would in a regular Python project.
          3. Zip up the venv/ folder and ship it with my application using --py-files.

          I'm struggling to get the workers to pick up Python dependencies from the packaged venv over what's in the system site-packages. All I want is to be able to ship out the dependencies with the application from a virtual environment all at once (i.e. without having to enumerate each dependency).

          Has anyone been able to do this today? It would be good to document it as a workaround for people until this issue is resolved.

          Show
          nchammas Nicholas Chammas added a comment - Prasanna Santhanam : Previously, I have had reasonable success with zipping the contents of my conda environment in the gateway/driver node and submitting the zip file as an argument to --archives in the spark-submit command line. This approach works perfectly because it uses the existing spark infrastructure to distribute dependencies through to the workers. You actually don't even need anaconda installed on the workers since the zip can package the entire python installation within it. The downside of it being that conda zip files can bloat up quickly in a production spark application. Can you elaborate on how you did this? I'm willing to jump through some hoops to create a hackish way of distributing dependencies while this JIRA task gets worked out. What I'm trying is: Create a virtual environment and activate it. Pip install my requirements into that environment, as one would in a regular Python project. Zip up the venv/ folder and ship it with my application using --py-files . I'm struggling to get the workers to pick up Python dependencies from the packaged venv over what's in the system site-packages. All I want is to be able to ship out the dependencies with the application from a virtual environment all at once (i.e. without having to enumerate each dependency). Has anyone been able to do this today? It would be good to document it as a workaround for people until this issue is resolved.
          Hide
          tsp Prasanna Santhanam added a comment -

          Thanks for this JIRA and the work related to it - I have been testing this patch a little with conda environments.

          Previously, I have had reasonable success with zipping the contents of my conda environment in the gateway/driver node and submitting the zip file as an argument to --archives in the spark-submit command line. This approach works perfectly because it uses the existing spark infrastructure to distribute dependencies through to the workers. You actually don't even need anaconda installed on the workers since the zip can package the entire python installation within it. The downside of it being that conda zip files can bloat up quickly in a production spark application.

          Jeff Zhang In your approach I find that the driver program still executes on the native python installation and only the workers run within conda (virtualenv) environments. Would it not be possible to use the same conda environment throughout? ie. setup once on gateway node and propagate over the distributed cache as mentioned in a related PR comment.

          I can always force the driver python to be conda using `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` but that is not the same conda environment as the one created by your PythonWorkerFactory. Or is it not your intention to make it work this way?

          Show
          tsp Prasanna Santhanam added a comment - Thanks for this JIRA and the work related to it - I have been testing this patch a little with conda environments. Previously, I have had reasonable success with zipping the contents of my conda environment in the gateway/driver node and submitting the zip file as an argument to --archives in the spark-submit command line. This approach works perfectly because it uses the existing spark infrastructure to distribute dependencies through to the workers. You actually don't even need anaconda installed on the workers since the zip can package the entire python installation within it. The downside of it being that conda zip files can bloat up quickly in a production spark application. Jeff Zhang In your approach I find that the driver program still executes on the native python installation and only the workers run within conda (virtualenv) environments. Would it not be possible to use the same conda environment throughout? ie. setup once on gateway node and propagate over the distributed cache as mentioned in a related PR comment. I can always force the driver python to be conda using `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` but that is not the same conda environment as the one created by your PythonWorkerFactory. Or is it not your intention to make it work this way?
          Hide
          gaetan@xeberon.net Semet added a comment -
          Show
          gaetan@xeberon.net Semet added a comment - Full proposal is here: https://issues.apache.org/jira/browse/SPARK-16367
          Hide
          takaomag Takao Magoori added a comment -

          Sorry. It seems there is no isolated site-package directory. Workdir is just added to sys.path.

          Show
          takaomag Takao Magoori added a comment - Sorry. It seems there is no isolated site-package directory. Workdir is just added to sys.path.
          Hide
          takaomag Takao Magoori added a comment -

          What is the reason why virtualenv is required ?
          I feel supporting naive virtualenv is unnecessary since each Spark executor(worker) already has its own isolated site-packages directory (workdir) just like naive virtualenv. Only for conda ?

          Show
          takaomag Takao Magoori added a comment - What is the reason why virtualenv is required ? I feel supporting naive virtualenv is unnecessary since each Spark executor(worker) already has its own isolated site-packages directory (workdir) just like naive virtualenv. Only for conda ?
          Hide
          gaetan@xeberon.net Semet added a comment - - edited

          yes it looks cool!
          Here is what I have in mind, tell me if it is the wrong direction

          • each job should execute in its own environment.
          • I love wheels, and wheelhouse. Providen the fact we build all the needed wheels on the same machine as the cluster, of we did retrived the right wheels on Pypi, pypi can install all dependencies with lightning speed, without the need of an internet connection (have configure the proxy for some corporates, or handle an internal mirror, etc).
          • so we deploy the job with a command line such as:
          bin/spark-submit --master $(spark_master) --deploy-mode client --conf "spark.pyspark.virtualenv.enabled=true" --conf "spark.pyspark.virtualenv.type=native" --conf "spark.pyspark.virtualenv.wheelhouse=/path/to/wheelhouse.zip" --conf "spark.pyspark.virtualenv.script=script_name" --conf "spark.pyspark.virtualenv.args='--opt1 --opt2'"
          

          so:

          • wheelhouse.zip contains the whole wheels to install in a fresh virtualenv. No internet connection, the script it also deployed and installed, provided they go created like a nice module page (so easy to do with pbr)
          • spark.pyspark.virtualenv.script is the execution point of the script. It should be declared in the script section in the setup.py
          • spark.pyspark.virtualenv.args allows to pass extra arguments to the script

          I don't have much experience on YARN or MESOS, what are the big differences?

          Show
          gaetan@xeberon.net Semet added a comment - - edited yes it looks cool! Here is what I have in mind, tell me if it is the wrong direction each job should execute in its own environment. I love wheels, and wheelhouse. Providen the fact we build all the needed wheels on the same machine as the cluster, of we did retrived the right wheels on Pypi, pypi can install all dependencies with lightning speed, without the need of an internet connection (have configure the proxy for some corporates, or handle an internal mirror, etc). so we deploy the job with a command line such as: bin/spark-submit --master $(spark_master) --deploy-mode client --conf "spark.pyspark.virtualenv.enabled= true " --conf "spark.pyspark.virtualenv.type= native " --conf "spark.pyspark.virtualenv.wheelhouse=/path/to/wheelhouse.zip" --conf "spark.pyspark.virtualenv.script=script_name" --conf "spark.pyspark.virtualenv.args='--opt1 --opt2'" so: wheelhouse.zip contains the whole wheels to install in a fresh virtualenv. No internet connection, the script it also deployed and installed, provided they go created like a nice module page (so easy to do with pbr) spark.pyspark.virtualenv.script is the execution point of the script. It should be declared in the script section in the setup.py spark.pyspark.virtualenv.args allows to pass extra arguments to the script I don't have much experience on YARN or MESOS, what are the big differences?
          Hide
          zjffdu Jeff Zhang added a comment -

          Thanks Semet Have you take a look at my PR ? welcome any comments.

          Show
          zjffdu Jeff Zhang added a comment - Thanks Semet Have you take a look at my PR ? welcome any comments.
          Hide
          gaetan@xeberon.net Semet added a comment -

          I back this proposal and willing to work on it.

          Show
          gaetan@xeberon.net Semet added a comment - I back this proposal and willing to work on it.
          Hide
          zjffdu Jeff Zhang added a comment - - edited

          Sorry, guys, I am busy on other stuff recently and late for updating this ticket. I just attached the desig doc and create the PR. If you are interested, please help review the design doc and try the PR, thanks Greg Bowyer Dripple Juliet Hougland Mike Sukmanowsky Dan Blanchard John Berryman

          Show
          zjffdu Jeff Zhang added a comment - - edited Sorry, guys, I am busy on other stuff recently and late for updating this ticket. I just attached the desig doc and create the PR. If you are interested, please help review the design doc and try the PR, thanks Greg Bowyer Dripple Juliet Hougland Mike Sukmanowsky Dan Blanchard John Berryman
          Hide
          apachespark Apache Spark added a comment -

          User 'zjffdu' has created a pull request for this issue:
          https://github.com/apache/spark/pull/13599

          Show
          apachespark Apache Spark added a comment - User 'zjffdu' has created a pull request for this issue: https://github.com/apache/spark/pull/13599
          Hide
          apachespark Apache Spark added a comment -

          User 'zjffdu' has created a pull request for this issue:
          https://github.com/apache/spark/pull/13598

          Show
          apachespark Apache Spark added a comment - User 'zjffdu' has created a pull request for this issue: https://github.com/apache/spark/pull/13598
          Hide
          zjffdu Jeff Zhang added a comment -

          Yes, I focus on yarn mode, I did some test on local mode, but not full test. I haven't done it on mesos yet.

          Show
          zjffdu Jeff Zhang added a comment - Yes, I focus on yarn mode, I did some test on local mode, but not full test. I haven't done it on mesos yet.
          Hide
          zjffdu Jeff Zhang added a comment -

          Thanks Greg Bowyer In my POC, I implemented it using both conda and native virtualenv/pip. User can choose whatever he want by specifying the option.

          Show
          zjffdu Jeff Zhang added a comment - Thanks Greg Bowyer In my POC, I implemented it using both conda and native virtualenv/pip. User can choose whatever he want by specifying the option.
          Hide
          gbowyer@fastmail.co.uk Greg Bowyer added a comment -

          There is also a small wrapper for this

          http://knit.readthedocs.io/en/latest/

          I guess there is no mesos and localmode support, but its a start

          Show
          gbowyer@fastmail.co.uk Greg Bowyer added a comment - There is also a small wrapper for this http://knit.readthedocs.io/en/latest/ I guess there is no mesos and localmode support, but its a start
          Hide
          gbowyer@fastmail.co.uk Greg Bowyer added a comment -

          Does this work for people ?

          https://www.continuum.io/blog/developer-blog/conda-spark

          The downsides are
          1. Its conda, it is not quite as standard as virtualenv / pip
          2. Conda will install none free software (e.g. MKL) - this might be a problem

          However I think that it might also help in that conda bundles dependencies like BLAS, LAPACK and friends which might save some folks heartache when dealing with badly configured clusters.

          Show
          gbowyer@fastmail.co.uk Greg Bowyer added a comment - Does this work for people ? https://www.continuum.io/blog/developer-blog/conda-spark The downsides are 1. Its conda, it is not quite as standard as virtualenv / pip 2. Conda will install none free software (e.g. MKL) - this might be a problem However I think that it might also help in that conda bundles dependencies like BLAS, LAPACK and friends which might save some folks heartache when dealing with badly configured clusters.
          Hide
          gbowyer@fastmail.co.uk Greg Bowyer added a comment -

          .... I have been out of this world for a long time.

          The Spex extension was designed to get around this, and was designed for a world where NFS was being removed from a cluster. Right now its a ugly ugly hack but I might be able to spend some time making it less of a hack.

          If this was integrated to spark would we want to make this transparent, or provide a flag to the launcher scripts. I figure a --py-deps=requirements.txt might work?

          Show
          gbowyer@fastmail.co.uk Greg Bowyer added a comment - .... I have been out of this world for a long time. The Spex extension was designed to get around this, and was designed for a world where NFS was being removed from a cluster. Right now its a ugly ugly hack but I might be able to spend some time making it less of a hack. If this was integrated to spark would we want to make this transparent, or provide a flag to the launcher scripts. I figure a --py-deps=requirements.txt might work?
          Hide
          JnBrymn John Berryman added a comment -

          At my work we're using devpi and a homecooked proxy server to "fake" pip and serve up cached wheels. I'm not sure such a solution should be built into Spark, but it is at least a POC for being able to make the virtualenv idea fast (after the first build at least). Really it wouldn't be hard to set this up outside of Spark, you would just need to make pip actually point to my_own_specially_wheel_caching_pip.

          Show
          JnBrymn John Berryman added a comment - At my work we're using devpi and a homecooked proxy server to "fake" pip and serve up cached wheels. I'm not sure such a solution should be built into Spark, but it is at least a POC for being able to make the virtualenv idea fast (after the first build at least). Really it wouldn't be hard to set this up outside of Spark, you would just need to make pip actually point to my_own_specially_wheel_caching_pip .
          Hide
          msukmanowsky Mike Sukmanowsky added a comment -

          That's the (hopefully) beautiful thing about pex.

          $ pex numpy pandas -o c_requirements.pex
          $ ./c_requirements.pex
          Python 2.7.10 (default, Aug  4 2015, 19:54:05)
          [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
          Type "help", "copyright", "credits" or "license" for more information.
          (InteractiveConsole)
          >>> import numpy
          >>> import pandas
          >>> print numpy.__version__
          1.11.0
          >>> print pandas.__version__
          0.18.0
          >>>
          

          The catch of course is that the pex file itself would have to be built on a node with the same arch as Spark workers (i.e. can't build pex on Mac OS and ship to Linux cluster unless all dependencies are pure python). To build a platform agnostic env, we'd have to look at conda.

          I know there was an effort to support pex with pyspark https://github.com/URXtech/spex but it hasn't seen much activity recently. I tried reaching out to the author but got no response.

          I could take a shot at adding support for this unless @zjffdu already has plans.

          Show
          msukmanowsky Mike Sukmanowsky added a comment - That's the (hopefully) beautiful thing about pex. $ pex numpy pandas -o c_requirements.pex $ ./c_requirements.pex Python 2.7.10 (default, Aug 4 2015, 19:54:05) [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin Type "help", "copyright", "credits" or "license" for more information. (InteractiveConsole) >>> import numpy >>> import pandas >>> print numpy.__version__ 1.11.0 >>> print pandas.__version__ 0.18.0 >>> The catch of course is that the pex file itself would have to be built on a node with the same arch as Spark workers (i.e. can't build pex on Mac OS and ship to Linux cluster unless all dependencies are pure python). To build a platform agnostic env, we'd have to look at conda. I know there was an effort to support pex with pyspark https://github.com/URXtech/spex but it hasn't seen much activity recently. I tried reaching out to the author but got no response. I could take a shot at adding support for this unless @zjffdu already has plans.
          Hide
          juliet Juliet Hougland added a comment -

          Being able to ship around pex files like we do .py and .egg files sounds very reasonable from a delineation of responsibilities perspective.

          I like the idea and would support a change like that. A question/edge case worth working out is how pex files relate to compiled c libs that python libs may need to link to. I don't know much about pex, but initial assessment is that it shouldn't be a huge problem. I like this solution.

          Show
          juliet Juliet Hougland added a comment - Being able to ship around pex files like we do .py and .egg files sounds very reasonable from a delineation of responsibilities perspective. I like the idea and would support a change like that. A question/edge case worth working out is how pex files relate to compiled c libs that python libs may need to link to. I don't know much about pex, but initial assessment is that it shouldn't be a huge problem. I like this solution.
          Hide
          msukmanowsky Mike Sukmanowsky added a comment -

          The NFS is a good idea for a current workaround. It's still a pain to build all required eggs in a dependency file like requirements.txt unless pyspark supports something like .pex files which build a self-contained python executable from said requirements file.

          If pex files were supported by pyspark, the daemon + workers would simply use /path/to/file.pex as the Python executable and get a fully baked virtualenv for free without Spark needing to concern itself with building these files.

          Twitter has reportedly used this to ship distributed Python applications for sometime.

          Show
          msukmanowsky Mike Sukmanowsky added a comment - The NFS is a good idea for a current workaround. It's still a pain to build all required eggs in a dependency file like requirements.txt unless pyspark supports something like .pex files which build a self-contained python executable from said requirements file. If pex files were supported by pyspark, the daemon + workers would simply use /path/to/file.pex as the Python executable and get a fully baked virtualenv for free without Spark needing to concern itself with building these files. Twitter has reportedly used this to ship distributed Python applications for sometime.
          Hide
          Dripple Dripple added a comment -

          Juliet Hougland can you give details/pointer on the solution you describe ?

          Thanks.

          Show
          Dripple Dripple added a comment - Juliet Hougland can you give details/pointer on the solution you describe ? Thanks.
          Hide
          zjffdu Jeff Zhang added a comment -

          Have you considered using NFS or Amazon EFS to allow users to create and manage their own envs and then mounting those on worker/executor nodes?

          This problem is most of time you are not administrator and don't have permission to do that. It's inefficient to ask your administrator to install the environment for you.

          "one alternative to shared mounts is to store the thing in HDFS and use something like --files / --archives in Spark.

          Some packages are binary and need to compile. And it is not easy to do dependency management in this way.

          Show
          zjffdu Jeff Zhang added a comment - Have you considered using NFS or Amazon EFS to allow users to create and manage their own envs and then mounting those on worker/executor nodes? This problem is most of time you are not administrator and don't have permission to do that. It's inefficient to ask your administrator to install the environment for you. "one alternative to shared mounts is to store the thing in HDFS and use something like --files / --archives in Spark. Some packages are binary and need to compile. And it is not easy to do dependency management in this way.
          Hide
          juliet Juliet Hougland added a comment -

          I really do think spark and pyspark needs to stay out of the business for installing anything for people. A generic executable is relatively neutral as to what exactly that executable does, which is good. Spark's scope should be computation/execution, not environment setup and teardown.

          Have you considered using NFS or Amazon EFS to allow users to create and manage their own envs and then mounting those on worker/executor nodes? This is an elegant solution we (many experienced people at Cloudera like Guru M and Tristan Z recommended this) have seen deployed successfully. I believe given the description of your problem it should suit your needs.

          Marcelo Vanzin as suggested "one alternative to shared mounts is to store the thing in HDFS and use something like --files / --archives in Spark. The distribution to new containers is handled by YARN, and Spark just would need some adjustments to find the right executable inside those archives."

          Show
          juliet Juliet Hougland added a comment - I really do think spark and pyspark needs to stay out of the business for installing anything for people. A generic executable is relatively neutral as to what exactly that executable does, which is good. Spark's scope should be computation/execution, not environment setup and teardown. Have you considered using NFS or Amazon EFS to allow users to create and manage their own envs and then mounting those on worker/executor nodes? This is an elegant solution we (many experienced people at Cloudera like Guru M and Tristan Z recommended this) have seen deployed successfully. I believe given the description of your problem it should suit your needs. Marcelo Vanzin as suggested "one alternative to shared mounts is to store the thing in HDFS and use something like --files / --archives in Spark. The distribution to new containers is handled by YARN, and Spark just would need some adjustments to find the right executable inside those archives."
          Hide
          msukmanowsky Mike Sukmanowsky added a comment -

          Sorry to bug Juliet Hougland - any thoughts? We're currently trying to think about some kind of a workaround for Spark 1.6.0 working on Amazon EMR that'd allow us to create a conda/virtual env on YARN nodes prior to the application running but I don't think there's anything we can really do.

          Building on my suggestion, it'd probably also be helpful to have a --teardown option added to spark-submit that'd allow execution of some script after the Spark application is terminated. This way conda/virtual envs with temporary names could be created/destroyed.

          Show
          msukmanowsky Mike Sukmanowsky added a comment - Sorry to bug Juliet Hougland - any thoughts? We're currently trying to think about some kind of a workaround for Spark 1.6.0 working on Amazon EMR that'd allow us to create a conda/virtual env on YARN nodes prior to the application running but I don't think there's anything we can really do. Building on my suggestion, it'd probably also be helpful to have a --teardown option added to spark-submit that'd allow execution of some script after the Spark application is terminated. This way conda/virtual envs with temporary names could be created/destroyed.
          Hide
          msukmanowsky Mike Sukmanowsky added a comment -

          Juliet Hougland I get the concerns relating to Spark supporting a complex virtualenv process. My main objection to only supporting something like --pyspark-python is the difficulty we currently face in locations like Amazon EMR, but really any Spark cluster where nodes are assumed to be added after an application is submitted deals with the issue.

          We have a bootstrap script which provisions our EMR nodes with required Python dependencies. This approach works alright for a cluster which tends to run very few applications, but if we have multiple tenants, this approach quickly gets unwieldy. Ideally, Spark applications could be submitted from a master node with a user never having to worry about dependency management at the node bootstrapping level.

          I was thinking that an interesting approach to this problem would be to provide some sort of a --bootstrap option to spark-submit which points to any executable which Spark will run and check for receipt of a 0 exit code before continuing to launch the application itself. This script could obviously execute any code such as creating a virtualenv or conda env and installing requirements. If a non-zero exit code were received, the Spark application would cease to continue.

          The generalization gets the Spark community away from having to support conda/virtualenv eccentricities. Thoughts?

          Show
          msukmanowsky Mike Sukmanowsky added a comment - Juliet Hougland I get the concerns relating to Spark supporting a complex virtualenv process. My main objection to only supporting something like --pyspark-python is the difficulty we currently face in locations like Amazon EMR, but really any Spark cluster where nodes are assumed to be added after an application is submitted deals with the issue. We have a bootstrap script which provisions our EMR nodes with required Python dependencies. This approach works alright for a cluster which tends to run very few applications, but if we have multiple tenants, this approach quickly gets unwieldy. Ideally, Spark applications could be submitted from a master node with a user never having to worry about dependency management at the node bootstrapping level. I was thinking that an interesting approach to this problem would be to provide some sort of a --bootstrap option to spark-submit which points to any executable which Spark will run and check for receipt of a 0 exit code before continuing to launch the application itself. This script could obviously execute any code such as creating a virtualenv or conda env and installing requirements. If a non-zero exit code were received, the Spark application would cease to continue. The generalization gets the Spark community away from having to support conda/virtualenv eccentricities. Thoughts?
          Hide
          zjffdu Jeff Zhang added a comment -

          SPARK-13081 is almost done. But for this ticket, I definitely need your feedback and help, I am not heavy python user, so may miss something that python programmer concerns.

          Show
          zjffdu Jeff Zhang added a comment - SPARK-13081 is almost done. But for this ticket, I definitely need your feedback and help, I am not heavy python user, so may miss something that python programmer concerns.
          Hide
          juliet Juliet Hougland added a comment -

          That is wonderful. Let me know if you'd like me to help work on it. It has been dangling on my mental todo list for a while.

          Show
          juliet Juliet Hougland added a comment - That is wonderful. Let me know if you'd like me to help work on it. It has been dangling on my mental todo list for a while.
          Hide
          zjffdu Jeff Zhang added a comment -

          Actually I have created this ticket several weeks ago SPARK-13081

          Show
          zjffdu Jeff Zhang added a comment - Actually I have created this ticket several weeks ago SPARK-13081
          Hide
          juliet Juliet Hougland added a comment -

          I made a comment related to this below. TLDR I think the suggested --py-env option could be encompassed an already needed --pyspark_python sort of option.

          Show
          juliet Juliet Hougland added a comment - I made a comment related to this below. TLDR I think the suggested --py-env option could be encompassed an already needed --pyspark_python sort of option.
          Hide
          juliet Juliet Hougland added a comment -

          Currently the way users specify the workers' python interpreter is through the PYSPARK_PYTHON env variable. It would be beneficial to users to allow that path to be specified by a cli flag. That is a current rough edge of using already installed envs on a cluster.

          If this was added as a cli flag, I could see valid options being 'pyspark/python/path', 'venv' (temp virtualenv), and 'conda' (temp conda env) and requiring a second flag to specify the requirements file. I think it helps prevent an explosion of flag for spark submit while helping handle a very important and often changed parameter for a job. What do you think of this?

          Show
          juliet Juliet Hougland added a comment - Currently the way users specify the workers' python interpreter is through the PYSPARK_PYTHON env variable. It would be beneficial to users to allow that path to be specified by a cli flag. That is a current rough edge of using already installed envs on a cluster. If this was added as a cli flag, I could see valid options being 'pyspark/python/path', 'venv' (temp virtualenv), and 'conda' (temp conda env) and requiring a second flag to specify the requirements file. I think it helps prevent an explosion of flag for spark submit while helping handle a very important and often changed parameter for a job. What do you think of this?
          Hide
          zjffdu Jeff Zhang added a comment -

          Thanks for the feedback Juliet Hougland. Yes, that's why I make the virtualenv as application scope. Creating python package management tool for a cluster will be a big project and too heavy. And what does "--pyspark_python" mean ?

          Show
          zjffdu Jeff Zhang added a comment - Thanks for the feedback Juliet Hougland . Yes, that's why I make the virtualenv as application scope. Creating python package management tool for a cluster will be a big project and too heavy. And what does "--pyspark_python" mean ?
          Hide
          zjffdu Jeff Zhang added a comment -

          Thanks for letting me know this.

          Show
          zjffdu Jeff Zhang added a comment - Thanks for letting me know this.
          Hide
          juliet Juliet Hougland added a comment - - edited

          If pyspark allows users to create virtual environments, users will also want and need other features of python environment management on a cluster. I think this change would broaden the scope of PySpark to include python package management on a cluster. I do not think that spark should be in the business of creating python environments. I think the support load in terms of feature requests, mailing list traffic, etc would be very large. This feature would begin to solve a problem, but would also put us on the hook for many more.

          I agree with the general intention of this JIRA – make it easier to manage and interact with complex python environments on a cluster. Perhaps there are other ways to accomplish this without broadening scope and functionality as much. For example, checking a requirements file against an environment before execution.

          Edit: I see now that you are proposing a short lived virtualenv. My objections about the broadening of scope still apply. I generally do not agree with suggestions that tightly tie us (and users) to a specific method of pyenv management. The loose coupling of python envs one a cluster to pyspark (via a path to an interpreter) is a positive feature. I would much rather add --pyspark_python to the cli tool (and deprecate the env var) than add a ton of logic to create environments for users.

          Show
          juliet Juliet Hougland added a comment - - edited If pyspark allows users to create virtual environments, users will also want and need other features of python environment management on a cluster. I think this change would broaden the scope of PySpark to include python package management on a cluster. I do not think that spark should be in the business of creating python environments. I think the support load in terms of feature requests, mailing list traffic, etc would be very large. This feature would begin to solve a problem, but would also put us on the hook for many more. I agree with the general intention of this JIRA – make it easier to manage and interact with complex python environments on a cluster. Perhaps there are other ways to accomplish this without broadening scope and functionality as much. For example, checking a requirements file against an environment before execution. Edit: I see now that you are proposing a short lived virtualenv. My objections about the broadening of scope still apply. I generally do not agree with suggestions that tightly tie us (and users) to a specific method of pyenv management. The loose coupling of python envs one a cluster to pyspark (via a path to an interpreter) is a positive feature. I would much rather add --pyspark_python to the cli tool (and deprecate the env var) than add a ton of logic to create environments for users.
          Hide
          dan.blanchard Dan Blanchard added a comment -

          `conda list --export` will omit all pip-installed requirements and only export the conda ones. It is pretty common for people to have conda environments where they've installed things that weren't available through conda via pip.

          Show
          dan.blanchard Dan Blanchard added a comment - `conda list --export` will omit all pip-installed requirements and only export the conda ones. It is pretty common for people to have conda environments where they've installed things that weren't available through conda via pip.
          Hide
          zjffdu Jeff Zhang added a comment -

          Thanks for the feedback Dan Blanchard, In my POC, I use "conda list --export" to create the requirement file and seems conda can consume this, Although the format is a little different from virtualenv.

          Show
          zjffdu Jeff Zhang added a comment - Thanks for the feedback Dan Blanchard , In my POC, I use "conda list --export" to create the requirement file and seems conda can consume this, Although the format is a little different from virtualenv.
          Hide
          dan.blanchard Dan Blanchard added a comment -

          One thing to note is that conda doesn't use the same requirements file format as virtualenv/pip. You'll want to use conda env create uses their own separate YAML format. I have a pull request open to support requirements.txt files, but that has been waiting for action for some time.

          Show
          dan.blanchard Dan Blanchard added a comment - One thing to note is that conda doesn't use the same requirements file format as virtualenv/pip. You'll want to use conda env create uses their own separate YAML format. I have a pull request open to support requirements.txt files , but that has been waiting for action for some time.
          Hide
          msukmanowsky Mike Sukmanowsky added a comment - - edited

          Perfect and understood about not wanting to promote these to first-class citizens without wider feedback. At the least, I'd say both -py-files and -py-venv options could be supported if we're concerned about introducing a deprecation like this.

          Show
          msukmanowsky Mike Sukmanowsky added a comment - - edited Perfect and understood about not wanting to promote these to first-class citizens without wider feedback. At the least, I'd say both - py-files and -py-venv options could be supported if we're concerned about introducing a deprecation like this.
          Hide
          zjffdu Jeff Zhang added a comment -

          spark.pyspark.virtualenv.requirements is a local file (which would be distributed to all nodes) Regarding upgrade these to first-class citizens, I would be conservative for that. Needs more feedback from other users.

          Show
          zjffdu Jeff Zhang added a comment - spark.pyspark.virtualenv.requirements is a local file (which would be distributed to all nodes) Regarding upgrade these to first-class citizens, I would be conservative for that. Needs more feedback from other users.
          Hide
          msukmanowsky Mike Sukmanowsky added a comment -

          One thought that just occurred to me, does spark.pyspark.virtualenv.requirements point to a path on the master node for a requirements file? It'd make sense if that was the case and then the requirements file was shipped to other nodes instead of assuming that this file existed on all Spark nodes at the same location.

          Also might be a good idea to upgrade these to first-class citizens of spark-submit by supporting them as optional params instead of config properties. I'd go so far as to say it makes sense to deprecate --py-files in favour of:

          • --py-venv-type=conda
          • --py-venv-bin=/path/to/conda
          • --py-venv-requirements=/local/path/to/requirements.txt
          Show
          msukmanowsky Mike Sukmanowsky added a comment - One thought that just occurred to me, does spark.pyspark.virtualenv.requirements point to a path on the master node for a requirements file? It'd make sense if that was the case and then the requirements file was shipped to other nodes instead of assuming that this file existed on all Spark nodes at the same location. Also might be a good idea to upgrade these to first-class citizens of spark-submit by supporting them as optional params instead of config properties. I'd go so far as to say it makes sense to deprecate --py-files in favour of: --py-venv-type=conda --py-venv-bin=/path/to/conda --py-venv-requirements=/local/path/to/requirements.txt
          Hide
          msukmanowsky Mike Sukmanowsky added a comment -

          Gotcha. I might suggest spark.pyspark.virtualenv.bin.path in that case.

          Show
          msukmanowsky Mike Sukmanowsky added a comment - Gotcha. I might suggest spark.pyspark.virtualenv.bin.path in that case.
          Hide
          zjffdu Jeff Zhang added a comment -

          Thanks for your feedback Mike Sukmanowsky. spark.pyspark.virtualenv.path is not the path where the virtualenv created, it is the path to the executable file for virtualenv/conda which is used for creating virtualenv ( I need to rename it to a more proper name to avoid confusing).
          In my POC, I will create virtualenv in all the executors not only driver. As you said, some python packages depends on C library, we can not guarantee it would work if we compile it in driver and distribute it to other nodes.

          Show
          zjffdu Jeff Zhang added a comment - Thanks for your feedback Mike Sukmanowsky . spark.pyspark.virtualenv.path is not the path where the virtualenv created, it is the path to the executable file for virtualenv/conda which is used for creating virtualenv ( I need to rename it to a more proper name to avoid confusing). In my POC, I will create virtualenv in all the executors not only driver. As you said, some python packages depends on C library, we can not guarantee it would work if we compile it in driver and distribute it to other nodes.
          Hide
          msukmanowsky Mike Sukmanowsky added a comment -

          Thanks for letting me know about this Jeff Zhang.

          I think in general, I'm +1 on the proposal.

          virtualenvs are the way to go to install requirements and ensure isolation of dependencies between multiple driver scripts. As you noted though, installing hefty requirements like pandas or numpy (assuming you aren't using Conda), would add a pretty significant overhead to startup which could be amortized if the driver was assumed to run for a long enough period of time. Conda of course would pretty well eliminate that problem as it provides pre-compiled binaries for most OSs.

          I'd like to offer PEX as an alternative, where spark-submit would build a self-contained virtualenv in a .pex file on the Spark master node and then distribute to all other nodes. However, it turns out PEX doesn't support editable requirements and introduces an assumption that all nodes in a cluster are homogenous so that a Python package with C extensions compiled on the master node would run on worker nodes without issue. The latter assumption may be a leap too far for all Spark users.

          One thing I'm not entirely sure of is the need for the spark.pyspark.virtualenv.path property. If the virtualenv is temporary, why would this path ever be specified? Wouldn't a temporary path be used and subsequently removed after the Python worker completes?

          Show
          msukmanowsky Mike Sukmanowsky added a comment - Thanks for letting me know about this Jeff Zhang . I think in general, I'm +1 on the proposal. virtualenvs are the way to go to install requirements and ensure isolation of dependencies between multiple driver scripts. As you noted though, installing hefty requirements like pandas or numpy (assuming you aren't using Conda), would add a pretty significant overhead to startup which could be amortized if the driver was assumed to run for a long enough period of time. Conda of course would pretty well eliminate that problem as it provides pre-compiled binaries for most OSs. I'd like to offer PEX as an alternative, where spark-submit would build a self-contained virtualenv in a .pex file on the Spark master node and then distribute to all other nodes. However, it turns out PEX doesn't support editable requirements and introduces an assumption that all nodes in a cluster are homogenous so that a Python package with C extensions compiled on the master node would run on worker nodes without issue. The latter assumption may be a leap too far for all Spark users. One thing I'm not entirely sure of is the need for the spark.pyspark.virtualenv.path property. If the virtualenv is temporary, why would this path ever be specified? Wouldn't a temporary path be used and subsequently removed after the Python worker completes?
          Hide
          srowen Sean Owen added a comment -

          CC Juliet Hougland for interest, comment

          Show
          srowen Sean Owen added a comment - CC Juliet Hougland for interest, comment
          Hide
          zjffdu Jeff Zhang added a comment - - edited

          This method is trying to create virtualenv before python worker start, and this virtualenv is application scope, after the spark application job finish, the virtualenv will be cleanup. And the virtualenvs don't need to be the same path for each node (In my POC, it is the yarn container working directory). So that means user don't need to manually install packages on each node (sometimes you even can't install packages on cluster due to security reason). This is the biggest benefit and purpose that user can create virtualenv on demand without touching each node even when you are not administrator. The cons is the extra cost for installing the required packages before starting python worker. But if it is an application which will run for several hours then the extra cost can be ignored.

          I have implemented POC for this features. Here's one simple command for how to use virtualenv in pyspark.

          bin/spark-submit --master yarn --deploy-mode client --conf "spark.pyspark.virtualenv.enabled=true" --conf "spark.pyspark.virtualenv.type=conda" --conf "spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" --conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda"  ~/work/virtualenv/spark.py
          

          There's 4 properties needs to be set

          • spark.pyspark.virtualenv.enabled (flag to enable virtualenv)
          • spark.pyspark.virtualenv.type (native/conda are supported, default is native)
          • spark.pyspark.virtualenv.requirements (requirement file for the dependencies)
          • spark.pyspark.virtualenv.path (path to the executable file for for virtualenv/conda which is used for creating virtualenv)

          Comments and feedback are welcome about how to improve it and whether it's valuable for users.

          Show
          zjffdu Jeff Zhang added a comment - - edited This method is trying to create virtualenv before python worker start, and this virtualenv is application scope, after the spark application job finish, the virtualenv will be cleanup. And the virtualenvs don't need to be the same path for each node (In my POC, it is the yarn container working directory). So that means user don't need to manually install packages on each node (sometimes you even can't install packages on cluster due to security reason). This is the biggest benefit and purpose that user can create virtualenv on demand without touching each node even when you are not administrator. The cons is the extra cost for installing the required packages before starting python worker. But if it is an application which will run for several hours then the extra cost can be ignored. I have implemented POC for this features. Here's one simple command for how to use virtualenv in pyspark. bin/spark-submit --master yarn --deploy-mode client --conf "spark.pyspark.virtualenv.enabled= true " --conf "spark.pyspark.virtualenv.type=conda" --conf "spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" --conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda" ~/work/virtualenv/spark.py There's 4 properties needs to be set spark.pyspark.virtualenv.enabled (flag to enable virtualenv) spark.pyspark.virtualenv.type (native/conda are supported, default is native) spark.pyspark.virtualenv.requirements (requirement file for the dependencies) spark.pyspark.virtualenv.path (path to the executable file for for virtualenv/conda which is used for creating virtualenv) Comments and feedback are welcome about how to improve it and whether it's valuable for users.

            People

            • Assignee:
              Unassigned
              Reporter:
              zjffdu Jeff Zhang
            • Votes:
              22 Vote for this issue
              Watchers:
              40 Start watching this issue

              Dates

              • Created:
                Updated:

                Development