Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19019

PySpark does not work with Python 3.6.0

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.6.4, 2.0.3, 2.1.1, 2.2.0
    • Component/s: PySpark
    • Labels:
      None

      Description

      Currently, PySpark does not work with Python 3.6.0.

      Running ./bin/pyspark simply throws the error as below:

      Traceback (most recent call last):
        File ".../spark/python/pyspark/shell.py", line 30, in <module>
          import pyspark
        File ".../spark/python/pyspark/__init__.py", line 46, in <module>
          from pyspark.context import SparkContext
        File ".../spark/python/pyspark/context.py", line 36, in <module>
          from pyspark.java_gateway import launch_gateway
        File ".../spark/python/pyspark/java_gateway.py", line 31, in <module>
          from py4j.java_gateway import java_import, JavaGateway, GatewayClient
        File "<frozen importlib._bootstrap>", line 961, in _find_and_load
        File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
        File "<frozen importlib._bootstrap>", line 646, in _load_unlocked
        File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible
        File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 18, in <module>
        File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pydoc.py", line 62, in <module>
          import pkgutil
        File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pkgutil.py", line 22, in <module>
          ModuleInfo = namedtuple('ModuleInfo', 'module_finder name ispkg')
        File ".../spark/python/pyspark/serializers.py", line 394, in namedtuple
          cls = _old_namedtuple(*args, **kwargs)
      TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'
      

      The problem is in https://github.com/apache/spark/blob/3c68944b229aaaeeaee3efcbae3e3be9a2914855/python/pyspark/serializers.py#L386-L394 as the error says and the cause seems because the arguments of namedtuple are now completely keyword-only arguments from Python 3.6.0 (See https://bugs.python.org/issue25628).

      We currently copy this function via types.FunctionType which does not set the default values of keyword-only arguments (meaning namedtuple._kwdefaults_) and this seems causing internally missing values in the function (non-bound arguments).

      This ends up as below:

      import types
      import collections
      
      def _copy_func(f):
          return types.FunctionType(f.__code__, f.__globals__, f.__name__,
              f.__defaults__, f.__closure__)
      
      _old_namedtuple = _copy_func(collections.namedtuple)
      
      _old_namedtuple(, "b")
      _old_namedtuple("a")
      

      If we call as below:

      >>> _old_namedtuple("a", "b")
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
      TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'
      

      It throws an exception as above becuase _kwdefaults_ for required keyword arguments seem unset in the copied function. So, if we give explicit value for these,

      >>> _old_namedtuple("a", "b", verbose=False, rename=False, module=None)
      <class '__main__.a'>
      

      It works fine.

      It seems now we should properly set these into the hijected one.

        Attachments

          Activity

            People

            • Assignee:
              hyukjin.kwon Hyukjin Kwon
              Reporter:
              hyukjin.kwon Hyukjin Kwon
            • Votes:
              4 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: