Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.1.0, 2.2.0
    • Component/s: PySpark
    • Labels:

      Issue Links

        Activity

        Hide
        prabinb Prabin Banka added a comment -

        We can write a simple setup.py file, for pyspark source distribution.
        Any end user, who intend to use pyspark modules need to do a pip install of pyspark and set the SPARK_HOME env variable, before importing the pyspark into this code.
        Also, we can introduce one more environment variable, say SPARK_VERSION, this needs to be validated against the pyspark installed version, during the import time. A dictionary could be maintained in a text file under spark/python, to validate the compatibility of pyspark and spark.

        Will this be sufficient ?

        Show
        prabinb Prabin Banka added a comment - We can write a simple setup.py file, for pyspark source distribution. Any end user, who intend to use pyspark modules need to do a pip install of pyspark and set the SPARK_HOME env variable, before importing the pyspark into this code. Also, we can introduce one more environment variable, say SPARK_VERSION, this needs to be validated against the pyspark installed version, during the import time. A dictionary could be maintained in a text file under spark/python, to validate the compatibility of pyspark and spark. Will this be sufficient ?
        Hide
        prabinb Prabin Banka added a comment -

        @Josh.. please comment.

        Show
        prabinb Prabin Banka added a comment - @Josh.. please comment.
        Hide
        adgaudio Alex Gaudio added a comment -

        I'm all for pip installable pyspark, but I'm confused about the ideal way to install the pyspark code. I'd also prefer to avoid introducing an extra variable, SPARK_VERSION. It seems to me that if we had the typical setup.py file that downloaded code from PyPi, then users would have to deal with differences in dependencies between the python version in PyPi and in their code pointed to by SPARK_HOME. Additionally, users would still need to download the spark jars or set SPARK_HOME, which means two (possibly different) versions of the python code are flying around. The fact that users have to manage the version, download spark into SPARK_HOME, and pip install pyspark doesn't seem quite right.

        What do you think about this: We create a setup.py file that requires SPARK_HOME be set in the environment (requiring that the user have downloaded Spark) BEFORE the pyspark code gets installed.

        An additional idea we could consider: Then, when pip or a user calls pyspark, we have "python setup.py install" redirect to "python setup.py develop." This installs pyspark in "development mode" and means that the pyspark code pointed to by $SPARK_HOME/python is the source of truth. (more about development mode here: https://pythonhosted.org/setuptools/setuptools.html#development-mode). My thinking for this is that since users need to specify SPARK_HOME, we might as well keep the python library with the spark code (as it currently is) to avoid potential compatibility conflicts. As a maintainer, we also don't need to update PyPi with the latest version of pyspark. Using "develop mode" as default may be a bad idea. I also don't know how to automatically prefer "setup.py develop" over "setup.py install".

        Last, and perhaps most obvious, if we create a setup.py file, we could also probably no longer include the py4j egg in the spark downloads as we'd rely on setuptools to provide the external libraries.

        Show
        adgaudio Alex Gaudio added a comment - I'm all for pip installable pyspark, but I'm confused about the ideal way to install the pyspark code. I'd also prefer to avoid introducing an extra variable, SPARK_VERSION. It seems to me that if we had the typical setup.py file that downloaded code from PyPi, then users would have to deal with differences in dependencies between the python version in PyPi and in their code pointed to by SPARK_HOME. Additionally, users would still need to download the spark jars or set SPARK_HOME, which means two (possibly different) versions of the python code are flying around. The fact that users have to manage the version, download spark into SPARK_HOME, and pip install pyspark doesn't seem quite right. What do you think about this: We create a setup.py file that requires SPARK_HOME be set in the environment (requiring that the user have downloaded Spark) BEFORE the pyspark code gets installed. An additional idea we could consider: Then, when pip or a user calls pyspark, we have "python setup.py install" redirect to "python setup.py develop." This installs pyspark in "development mode" and means that the pyspark code pointed to by $SPARK_HOME/python is the source of truth. (more about development mode here: https://pythonhosted.org/setuptools/setuptools.html#development-mode ). My thinking for this is that since users need to specify SPARK_HOME, we might as well keep the python library with the spark code (as it currently is) to avoid potential compatibility conflicts. As a maintainer, we also don't need to update PyPi with the latest version of pyspark. Using "develop mode" as default may be a bad idea. I also don't know how to automatically prefer "setup.py develop" over "setup.py install". Last, and perhaps most obvious, if we create a setup.py file, we could also probably no longer include the py4j egg in the spark downloads as we'd rely on setuptools to provide the external libraries.
        Hide
        nrchandan Chandan Kumar added a comment -

        Alex Gaudio I had similar reservations about the approach. I will try to investigate the possibility of using 'develop' mode.

        Show
        nrchandan Chandan Kumar added a comment - Alex Gaudio I had similar reservations about the approach. I will try to investigate the possibility of using 'develop' mode.
        Hide
        davies Davies Liu added a comment -

        Because PySpark depends on Spark packages, Python user can not use it after 'pip install pyspark', so there is not too much benefits from this.

        Once we release PySpark separated from Spark, then we should keep the compatability across versions of PySpark and Spark, it will be a nightmare for us (we can not move fast to improve the implementation of PySpark).

        So, I think we can not do this in near future. Prabin Banka, do you mind to close the PR?

        Show
        davies Davies Liu added a comment - Because PySpark depends on Spark packages, Python user can not use it after 'pip install pyspark', so there is not too much benefits from this. Once we release PySpark separated from Spark, then we should keep the compatability across versions of PySpark and Spark, it will be a nightmare for us (we can not move fast to improve the implementation of PySpark). So, I think we can not do this in near future. Prabin Banka , do you mind to close the PR?
        Hide
        prabinb Prabin Banka added a comment -

        Closing this PR for now.

        Show
        prabinb Prabin Banka added a comment - Closing this PR for now.
        Hide
        apachespark Apache Spark added a comment -

        User 'alope107' has created a pull request for this issue:
        https://github.com/apache/spark/pull/8318

        Show
        apachespark Apache Spark added a comment - User 'alope107' has created a pull request for this issue: https://github.com/apache/spark/pull/8318
        Hide
        holdenk holdenk added a comment -

        re-opening after discussion on mailing list and PR thread.

        Show
        holdenk holdenk added a comment - re-opening after discussion on mailing list and PR thread.
        Hide
        apachespark Apache Spark added a comment -

        User 'holdenk' has created a pull request for this issue:
        https://github.com/apache/spark/pull/15659

        Show
        apachespark Apache Spark added a comment - User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/15659
        Hide
        joshrosen Josh Rosen added a comment -

        Merged into master (2.2) and will consider for 2.1.

        Show
        joshrosen Josh Rosen added a comment - Merged into master (2.2) and will consider for 2.1.

          People

          • Assignee:
            holdenk holdenk
            Reporter:
            prabinb Prabin Banka
          • Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development