Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4.0
    • Component/s: PySpark
    • Labels:
      None
    • Target Version/s:

      Description

      It would be nice to have Python 3 support in PySpark, provided that we can do it in a way that maintains backwards-compatibility with Python 2.6.

      I started looking into porting this; my WIP work can be found at https://github.com/JoshRosen/spark/compare/python3

      I was able to use the futurize tool to handle the basic conversion of things like print statements, etc. and had to manually fix up a few imports for packages that moved / were renamed, but the major blocker that I hit was cloudpickle:

      [joshrosen python (python3)]$ PYSPARK_PYTHON=python3 ../bin/pyspark
      Python 3.4.2 (default, Oct 19 2014, 17:52:17)
      [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.51)] on darwin
      Type "help", "copyright", "credits" or "license" for more information.
      Traceback (most recent call last):
        File "/Users/joshrosen/Documents/Spark/python/pyspark/shell.py", line 28, in <module>
          import pyspark
        File "/Users/joshrosen/Documents/spark/python/pyspark/__init__.py", line 41, in <module>
          from pyspark.context import SparkContext
        File "/Users/joshrosen/Documents/spark/python/pyspark/context.py", line 26, in <module>
          from pyspark import accumulators
        File "/Users/joshrosen/Documents/spark/python/pyspark/accumulators.py", line 97, in <module>
          from pyspark.cloudpickle import CloudPickler
        File "/Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py", line 120, in <module>
          class CloudPickler(pickle.Pickler):
        File "/Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py", line 122, in CloudPickler
          dispatch = pickle.Pickler.dispatch.copy()
      AttributeError: type object '_pickle.Pickler' has no attribute 'dispatch'
      

      This code looks like it will be hard difficult to port to Python 3, so this might be a good reason to switch to Dill for Python serialization.

        Issue Links

          Activity

          Hide
          twneale thom neale added a comment -

          Thank you for looking into this, Josh! Was able to get the module itself to import/run with some changes, but can't run the tests yet because I don't have spark built. But here's the pull request to your fork: https://github.com/JoshRosen/spark/pull/1. Some of these changes merit further investigation, but it's a start.

          Show
          twneale thom neale added a comment - Thank you for looking into this, Josh! Was able to get the module itself to import/run with some changes, but can't run the tests yet because I don't have spark built. But here's the pull request to your fork: https://github.com/JoshRosen/spark/pull/1 . Some of these changes merit further investigation, but it's a start.
          Hide
          joshrosen Josh Rosen added a comment -

          I've merged thom neale's PR but there are still a bunch of things that don't actually run. It looks like this is going to be quite a bit of work. My {{python3}] branch is up-to-date, in case anyone wants to pick this up.

          Show
          joshrosen Josh Rosen added a comment - I've merged thom neale 's PR but there are still a bunch of things that don't actually run. It looks like this is going to be quite a bit of work. My {{python3}] branch is up-to-date, in case anyone wants to pick this up.
          Hide
          matthewcornell Matthew Cornell added a comment -

          Please!!

          Show
          matthewcornell Matthew Cornell added a comment - Please!!
          Hide
          ianozsvald Ian Ozsvald added a comment -

          If I can cast a vote...

          I note that Python 2.6 is the lowest version of Python that's supported, some recent data might suggest that Python 2.6 support isn't so useful in the wider ecosystem and so might be slowing Spark development. A "Python 2 vs 3" survey was conducted before Christmas, the results are recently in:
          http://www.randalolson.com/2015/01/30/python-usage-survey-2014/

          Of 6,746 respondents less than 10% use Python 2.6 day-to-day. 81% use Python 2.7 (and 43% Python 3.4 - including me) for day-to-day use (presumably for work), there's an approximate 50/50 split between Python 2 & 3 for personal projects. I'd humbly suggest that supporting Python 2.6 will slow development and avoiding Python 3.4 will hinder winder adoption.

          The same survey a year back had 4,790 respondents, the second diagram on randalolson's site compares 2013 to 2014 - fewer people now are writing Python 2 day-to-day and more people are writing Python 3 (though Python 2.7 is still significantly dominant). Given that Python 2.7 will be deprecated by 2020 the trend to Python 3.4 is clear. Core scientific libraries (e.g. scipy, numpy, pandas, matplotlib) all work in Python 3.4 and have done for several years.

          The survey doesn't ask respondents whether they are web-devs, data scientists, ETL-folk, dev-ops etc so it is hard to extrapolate whether Spark-users are predominantly Python 2.6/2.7/3.4 but I'd suggest that a local survey in this community might provide useful guidance.

          Although it is on a longer cycle the major Linux distros like Ubuntu are switching away from Python 2.7 to Python 3+:
          https://www.archlinux.org/news/python-is-now-python-3/ # switched 2010
          http://www.phoronix.com/scan.php?page=news_item&px=Fedora-22-Python-3-Status # Fedora to Python 3 around May 2015
          https://wiki.ubuntu.com/Python/3 # work on-going, maybe the switch occurs in 2015?

          What is the use case for Python 2.6 support? Personally I'd vote for supporting 2.7 as a minimum with a strong push for Python 3.4 compatibility to reduce wasted hours supporting older Python versions. Supporting older Pythons will also hinder the creation of a Python 2.7/3.4 compatible code-base due to cross-language complications.

          About me - long-time speaker/teacher at Python conferences, O'Reilly author (High Performance Python), co-org of the 1000+ member PyDataLondon meetup and conference series, Python3.4 proponent since April 2014. At my PyData meetup I regularly query my usergroup (approx. 100 attendees each month), 1% use Python 2.6, the majority use Python 2.7, each month more people switch up to Python 3.4 (mainly to get away from unicode errors during text processing).

          Show
          ianozsvald Ian Ozsvald added a comment - If I can cast a vote... I note that Python 2.6 is the lowest version of Python that's supported, some recent data might suggest that Python 2.6 support isn't so useful in the wider ecosystem and so might be slowing Spark development. A "Python 2 vs 3" survey was conducted before Christmas, the results are recently in: http://www.randalolson.com/2015/01/30/python-usage-survey-2014/ Of 6,746 respondents less than 10% use Python 2.6 day-to-day. 81% use Python 2.7 (and 43% Python 3.4 - including me) for day-to-day use (presumably for work), there's an approximate 50/50 split between Python 2 & 3 for personal projects. I'd humbly suggest that supporting Python 2.6 will slow development and avoiding Python 3.4 will hinder winder adoption. The same survey a year back had 4,790 respondents, the second diagram on randalolson's site compares 2013 to 2014 - fewer people now are writing Python 2 day-to-day and more people are writing Python 3 (though Python 2.7 is still significantly dominant). Given that Python 2.7 will be deprecated by 2020 the trend to Python 3.4 is clear. Core scientific libraries (e.g. scipy, numpy, pandas, matplotlib) all work in Python 3.4 and have done for several years. The survey doesn't ask respondents whether they are web-devs, data scientists, ETL-folk, dev-ops etc so it is hard to extrapolate whether Spark-users are predominantly Python 2.6/2.7/3.4 but I'd suggest that a local survey in this community might provide useful guidance. Although it is on a longer cycle the major Linux distros like Ubuntu are switching away from Python 2.7 to Python 3+: https://www.archlinux.org/news/python-is-now-python-3/ # switched 2010 http://www.phoronix.com/scan.php?page=news_item&px=Fedora-22-Python-3-Status # Fedora to Python 3 around May 2015 https://wiki.ubuntu.com/Python/3 # work on-going, maybe the switch occurs in 2015? What is the use case for Python 2.6 support? Personally I'd vote for supporting 2.7 as a minimum with a strong push for Python 3.4 compatibility to reduce wasted hours supporting older Python versions. Supporting older Pythons will also hinder the creation of a Python 2.7/3.4 compatible code-base due to cross-language complications. About me - long-time speaker/teacher at Python conferences, O'Reilly author (High Performance Python), co-org of the 1000+ member PyDataLondon meetup and conference series, Python3.4 proponent since April 2014. At my PyData meetup I regularly query my usergroup (approx. 100 attendees each month), 1% use Python 2.6, the majority use Python 2.7, each month more people switch up to Python 3.4 (mainly to get away from unicode errors during text processing).
          Hide
          joshrosen Josh Rosen added a comment -

          Hi Ian Ozsvald,

          Until now, the main motivation for Python 2.6 support was that it's the default system Python on a few Linux distributions. So far, I think the overhead of supporting 2.6 has been fairly minimal, mostly involving a handful of small changes such as not treating certain object as context managers (e.g. Zipfile objects).

          Let's try porting to 2.7 / 3.4 and then re-assess how hard Python 2.6 support will be. If it's really easy (a couple hours of work, max) then I don't see a reason to drop it, but if we have to go to increasingly convoluted lengths to keep it then it's probably not worth it if we're gaining 3.4 support in return.

          I think the main blocker to Python 3.4 support is the fact that nobody has really had time to work on it. I'd be happy to work with anyone who is interested in taking this on.

          Show
          joshrosen Josh Rosen added a comment - Hi Ian Ozsvald , Until now, the main motivation for Python 2.6 support was that it's the default system Python on a few Linux distributions. So far, I think the overhead of supporting 2.6 has been fairly minimal, mostly involving a handful of small changes such as not treating certain object as context managers (e.g. Zipfile objects). Let's try porting to 2.7 / 3.4 and then re-assess how hard Python 2.6 support will be. If it's really easy (a couple hours of work, max) then I don't see a reason to drop it, but if we have to go to increasingly convoluted lengths to keep it then it's probably not worth it if we're gaining 3.4 support in return. I think the main blocker to Python 3.4 support is the fact that nobody has really had time to work on it. I'd be happy to work with anyone who is interested in taking this on.
          Hide
          twneale thom neale added a comment -

          I'm still very interested in helping with the 3.4 port, have only been
          prohibited by lack of free time. I'll ask if work will give me a half day
          to work on it.

          Show
          twneale thom neale added a comment - I'm still very interested in helping with the 3.4 port, have only been prohibited by lack of free time. I'll ask if work will give me a half day to work on it.
          Hide
          joshrosen Josh Rosen added a comment -

          By the way, it might be nice to see if we can figure out a good way of subdividing this task across multiple PRs so that the pieces that we have already figured out don't end up bitrotting / becoming merge-conflicts. For instance, if we can test the `cloudpickle.py` file separately from the other modules, then we could submit a PR that only adds 3.4 support to that file. If you can spot any other natural subproblems here, leave a comment or create a sub-task on this JIRA ticket.

          Show
          joshrosen Josh Rosen added a comment - By the way, it might be nice to see if we can figure out a good way of subdividing this task across multiple PRs so that the pieces that we have already figured out don't end up bitrotting / becoming merge-conflicts. For instance, if we can test the `cloudpickle.py` file separately from the other modules, then we could submit a PR that only adds 3.4 support to that file. If you can spot any other natural subproblems here, leave a comment or create a sub-task on this JIRA ticket.
          Hide
          ianozsvald Ian Ozsvald added a comment -

          Hi Josh. After my post I went to do some digging, I believe (now) that Py2.6 support isn't hard relative to Py2.7 and that only Py2.5 makes it a real pain. Apologies for confusing the issue.

          I'll stand by my position that Py3.4 support is the way to go as the whole community is marching in that direction. I do think that the intersection of Python & Spark users will increasingly be around Py3.4+ over the coming year.

          Cheers, i.

          Show
          ianozsvald Ian Ozsvald added a comment - Hi Josh. After my post I went to do some digging, I believe (now) that Py2.6 support isn't hard relative to Py2.7 and that only Py2.5 makes it a real pain. Apologies for confusing the issue. I'll stand by my position that Py3.4 support is the way to go as the whole community is marching in that direction. I do think that the intersection of Python & Spark users will increasingly be around Py3.4+ over the coming year. Cheers, i.
          Hide
          jimmyc Jimmy C added a comment -

          I'm very interested in using Spark in my projects, but the lack of Python 3 support unfortunately makes this very difficult. I hope this ticket can be prioritized.

          This has recently been brought up on Reddit as well https://www.reddit.com/r/Python/comments/2uz513/is_it_possible_to_use_apache_spark_with_python_3/

          Show
          jimmyc Jimmy C added a comment - I'm very interested in using Spark in my projects, but the lack of Python 3 support unfortunately makes this very difficult. I hope this ticket can be prioritized. This has recently been brought up on Reddit as well https://www.reddit.com/r/Python/comments/2uz513/is_it_possible_to_use_apache_spark_with_python_3/
          Hide
          ryanovas Ryan Ovas added a comment -

          I'm interested in using Spark in my startup, but everything we do is in Python 3.4 which makes adopting Spark difficult for me as well. I was surprised and disappointed (since I will have trouble using it myself) to see that there is no Python 3.x support when (as Ian Ozsvald suggested) the community as a whole is moving towards Python 3.4.

          Show
          ryanovas Ryan Ovas added a comment - I'm interested in using Spark in my startup, but everything we do is in Python 3.4 which makes adopting Spark difficult for me as well. I was surprised and disappointed (since I will have trouble using it myself) to see that there is no Python 3.x support when (as Ian Ozsvald suggested) the community as a whole is moving towards Python 3.4.
          Hide
          apachespark Apache Spark added a comment -

          User 'davies' has created a pull request for this issue:
          https://github.com/apache/spark/pull/5173

          Show
          apachespark Apache Spark added a comment - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/5173
          Hide
          joshrosen Josh Rosen added a comment -

          There's now an open pull request for this, which is now passing tests, and I'm beginning to review it now: https://github.com/apache/spark/pull/5173. If anyone is interested in helping, it would be great to get more eyes on this PR.

          Show
          joshrosen Josh Rosen added a comment - There's now an open pull request for this, which is now passing tests, and I'm beginning to review it now: https://github.com/apache/spark/pull/5173 . If anyone is interested in helping, it would be great to get more eyes on this PR.
          Hide
          watsonix watson xi added a comment -

          Hi guys, whats the status of this project? I know a few people (including myself) who are ready to wave goodbye to Python 2 (its been 6.5 years now!)... from an outside perspective looking it, Python 3 compatibility appears close!

          Show
          watsonix watson xi added a comment - Hi guys, whats the status of this project? I know a few people (including myself) who are ready to wave goodbye to Python 2 (its been 6.5 years now!)... from an outside perspective looking it, Python 3 compatibility appears close!
          Hide
          nchammas Nicholas Chammas added a comment -

          watson xi You can follow the active PR linked above in this JIRA issue.

          Show
          nchammas Nicholas Chammas added a comment - watson xi You can follow the active PR linked above in this JIRA issue.
          Hide
          davies Davies Liu added a comment -

          That PR is pretty close to merge, we are targeting this for 1.4 release. It will be helpful if you guy can test this in your environments. Currently, it's only covered by unit tests.

          Show
          davies Davies Liu added a comment - That PR is pretty close to merge, we are targeting this for 1.4 release. It will be helpful if you guy can test this in your environments. Currently, it's only covered by unit tests.
          Hide
          joshrosen Josh Rosen added a comment -

          Issue resolved by pull request 5173
          https://github.com/apache/spark/pull/5173

          Show
          joshrosen Josh Rosen added a comment - Issue resolved by pull request 5173 https://github.com/apache/spark/pull/5173

            People

            • Assignee:
              davies Davies Liu
              Reporter:
              joshrosen Josh Rosen
            • Votes:
              18 Vote for this issue
              Watchers:
              24 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development