I'm all for pip installable pyspark, but I'm confused about the ideal way to install the pyspark code. I'd also prefer to avoid introducing an extra variable, SPARK_VERSION. It seems to me that if we had the typical setup.py file that downloaded code from PyPi, then users would have to deal with differences in dependencies between the python version in PyPi and in their code pointed to by SPARK_HOME. Additionally, users would still need to download the spark jars or set SPARK_HOME, which means two (possibly different) versions of the python code are flying around. The fact that users have to manage the version, download spark into SPARK_HOME, and pip install pyspark doesn't seem quite right.
What do you think about this: We create a setup.py file that requires SPARK_HOME be set in the environment (requiring that the user have downloaded Spark) BEFORE the pyspark code gets installed.
An additional idea we could consider: Then, when pip or a user calls pyspark, we have "python setup.py install" redirect to "python setup.py develop." This installs pyspark in "development mode" and means that the pyspark code pointed to by $SPARK_HOME/python is the source of truth. (more about development mode here: https://pythonhosted.org/setuptools/setuptools.html#development-mode). My thinking for this is that since users need to specify SPARK_HOME, we might as well keep the python library with the spark code (as it currently is) to avoid potential compatibility conflicts. As a maintainer, we also don't need to update PyPi with the latest version of pyspark. Using "develop mode" as default may be a bad idea. I also don't know how to automatically prefer "setup.py develop" over "setup.py install".
Last, and perhaps most obvious, if we create a setup.py file, we could also probably no longer include the py4j egg in the spark downloads as we'd rely on setuptools to provide the external libraries.