Currently, it's not easy for user to add third party python packages in pyspark.
- One way is to using --py-files (suitable for simple dependency, but not suitable for complicated dependency, especially with transitive dependency)
- Another way is install packages manually on each node (time wasting, and not easy to switch to different environment)
Python has now 2 different virtualenv implementation. One is native virtualenv another is through conda. This jira is trying to migrate these 2 tools to distributed environment
|virtualenv example does not work in yarn cluster mode||Resolved||Unassigned|
|Kmeans.py application fails with virtualenv and due to parse error||Resolved||Unassigned|
|virtualenv example failed with conda due to ImportError: No module named ruamel.yaml.comments||Resolved||Unassigned|