Currently, it's not easy for user to add third party python packages in pyspark.
- One way is to using --py-files (suitable for simple dependency, but not suitable for complicated dependency, especially with transitive dependency)
- Another way is install packages manually on each node (time wasting, and not easy to switch to different environment)
Python has now 2 different virtualenv implementation. One is native virtualenv another is through conda. This jira is trying to migrate these 2 tools to distributed environment