Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19019

PySpark does not work with Python 3.6.0

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.6.4, 2.0.3, 2.1.1, 2.2.0
    • Component/s: PySpark
    • Labels:
      None

      Description

      Currently, PySpark does not work with Python 3.6.0.

      Running ./bin/pyspark simply throws the error as below:

      Traceback (most recent call last):
        File ".../spark/python/pyspark/shell.py", line 30, in <module>
          import pyspark
        File ".../spark/python/pyspark/__init__.py", line 46, in <module>
          from pyspark.context import SparkContext
        File ".../spark/python/pyspark/context.py", line 36, in <module>
          from pyspark.java_gateway import launch_gateway
        File ".../spark/python/pyspark/java_gateway.py", line 31, in <module>
          from py4j.java_gateway import java_import, JavaGateway, GatewayClient
        File "<frozen importlib._bootstrap>", line 961, in _find_and_load
        File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
        File "<frozen importlib._bootstrap>", line 646, in _load_unlocked
        File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible
        File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 18, in <module>
        File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pydoc.py", line 62, in <module>
          import pkgutil
        File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pkgutil.py", line 22, in <module>
          ModuleInfo = namedtuple('ModuleInfo', 'module_finder name ispkg')
        File ".../spark/python/pyspark/serializers.py", line 394, in namedtuple
          cls = _old_namedtuple(*args, **kwargs)
      TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'
      

      The problem is in https://github.com/apache/spark/blob/3c68944b229aaaeeaee3efcbae3e3be9a2914855/python/pyspark/serializers.py#L386-L394 as the error says and the cause seems because the arguments of namedtuple are now completely keyword-only arguments from Python 3.6.0 (See https://bugs.python.org/issue25628).

      We currently copy this function via types.FunctionType which does not set the default values of keyword-only arguments (meaning namedtuple._kwdefaults_) and this seems causing internally missing values in the function (non-bound arguments).

      This ends up as below:

      import types
      import collections
      
      def _copy_func(f):
          return types.FunctionType(f.__code__, f.__globals__, f.__name__,
              f.__defaults__, f.__closure__)
      
      _old_namedtuple = _copy_func(collections.namedtuple)
      
      _old_namedtuple(, "b")
      _old_namedtuple("a")
      

      If we call as below:

      >>> _old_namedtuple("a", "b")
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
      TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'
      

      It throws an exception as above becuase _kwdefaults_ for required keyword arguments seem unset in the copied function. So, if we give explicit value for these,

      >>> _old_namedtuple("a", "b", verbose=False, rename=False, module=None)
      <class '__main__.a'>
      

      It works fine.

      It seems now we should properly set these into the hijected one.

        Issue Links

          Activity

          Hide
          apachespark Apache Spark added a comment -

          User 'HyukjinKwon' has created a pull request for this issue:
          https://github.com/apache/spark/pull/16429

          Show
          apachespark Apache Spark added a comment - User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/16429
          Hide
          davies Davies Liu added a comment -

          Issue resolved by pull request 16429
          https://github.com/apache/spark/pull/16429

          Show
          davies Davies Liu added a comment - Issue resolved by pull request 16429 https://github.com/apache/spark/pull/16429
          Hide
          zero323 Maciej Szymkiewicz added a comment -

          Davies Liu Could it be backported to 1.6 and 2.0?

          Show
          zero323 Maciej Szymkiewicz added a comment - Davies Liu Could it be backported to 1.6 and 2.0?
          Hide
          henrytxz Henry Zhang added a comment - - edited

          Would also be interested in the answer to Maciej's question (for 2.0) and when is 2.1.1 scheduled to be released? Thank you!

          Show
          henrytxz Henry Zhang added a comment - - edited Would also be interested in the answer to Maciej's question (for 2.0) and when is 2.1.1 scheduled to be released? Thank you!
          Hide
          hyukjin.kwon Hyukjin Kwon added a comment -

          Let me try to make a PR to backport this if this is confirmed.

          Show
          hyukjin.kwon Hyukjin Kwon added a comment - Let me try to make a PR to backport this if this is confirmed.
          Hide
          apachespark Apache Spark added a comment -

          User 'HyukjinKwon' has created a pull request for this issue:
          https://github.com/apache/spark/pull/17374

          Show
          apachespark Apache Spark added a comment - User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/17374
          Hide
          apachespark Apache Spark added a comment -

          User 'HyukjinKwon' has created a pull request for this issue:
          https://github.com/apache/spark/pull/17375

          Show
          apachespark Apache Spark added a comment - User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/17375
          Hide
          hyukjin.kwon Hyukjin Kwon added a comment -

          To solve this problem fully, I had to port cloudpickle change too in the PR. Only fixing hijected one described above dose not fully solve this issue. Please refer the discussion in the PR and the change.

          Show
          hyukjin.kwon Hyukjin Kwon added a comment - To solve this problem fully, I had to port cloudpickle change too in the PR. Only fixing hijected one described above dose not fully solve this issue. Please refer the discussion in the PR and the change.
          Hide
          MrMathias Mathias M. Andersen added a comment - - edited

          Just got this error post fix on spark 2.1:

          Traceback (most recent call last):
              File "/opt/anaconda3/lib/python3.6/runpy.py", line 183, in _run_module_as_main
                mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
              File "/opt/anaconda3/lib/python3.6/runpy.py", line 109, in _get_module_details
                __import__(pkg_name)
              File "/usr/hdp/current/spark-client/python/pyspark/__init__.py", line 41, in <module>
                from pyspark.context import SparkContext
              File "/usr/hdp/current/spark-client/python/pyspark/context.py", line 33, in <module>
                from pyspark.java_gateway import launch_gateway
              File "/usr/hdp/current/spark-client/python/pyspark/java_gateway.py", line 25, in <module>
                import platform
              File "/opt/anaconda3/lib/python3.6/platform.py", line 886, in <module>
                "system node release version machine processor")
              File "/usr/hdp/current/spark-client/python/pyspark/serializers.py", line 381, in namedtuple
                cls = _old_namedtuple(*args, **kwargs)
            TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'
          
          Show
          MrMathias Mathias M. Andersen added a comment - - edited Just got this error post fix on spark 2.1: Traceback (most recent call last): File "/opt/anaconda3/lib/python3.6/runpy.py" , line 183, in _run_module_as_main mod_name, mod_spec, code = _get_module_details(mod_name, _Error) File "/opt/anaconda3/lib/python3.6/runpy.py" , line 109, in _get_module_details __import__(pkg_name) File "/usr/hdp/current/spark-client/python/pyspark/__init__.py" , line 41, in <module> from pyspark.context import SparkContext File "/usr/hdp/current/spark-client/python/pyspark/context.py" , line 33, in <module> from pyspark.java_gateway import launch_gateway File "/usr/hdp/current/spark-client/python/pyspark/java_gateway.py" , line 25, in <module> import platform File "/opt/anaconda3/lib/python3.6/platform.py" , line 886, in <module> "system node release version machine processor" ) File "/usr/hdp/current/spark-client/python/pyspark/serializers.py" , line 381, in namedtuple cls = _old_namedtuple(*args, **kwargs) TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'
          Hide
          hyukjin.kwon Hyukjin Kwon added a comment -

          I think this was backported into Spark 2.1.1. Was your Spark version, 2.1.1+?

          Show
          hyukjin.kwon Hyukjin Kwon added a comment - I think this was backported into Spark 2.1.1. Was your Spark version, 2.1.1+?
          Hide
          MrMathias Mathias M. Andersen added a comment -

          Yea. This was just a pythonpath mishab on our end. 2.1.1 is a-okay.

          Show
          MrMathias Mathias M. Andersen added a comment - Yea. This was just a pythonpath mishab on our end. 2.1.1 is a-okay.

            People

            • Assignee:
              hyukjin.kwon Hyukjin Kwon
              Reporter:
              hyukjin.kwon Hyukjin Kwon
            • Votes:
              4 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development