Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19019

PySpark does not work with Python 3.6.0

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments


    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • 1.6.4, 2.0.3, 2.1.1, 2.2.0
    • PySpark
    • None


      Currently, PySpark does not work with Python 3.6.0.

      Running ./bin/pyspark simply throws the error as below:

      Traceback (most recent call last):
        File ".../spark/python/pyspark/shell.py", line 30, in <module>
          import pyspark
        File ".../spark/python/pyspark/__init__.py", line 46, in <module>
          from pyspark.context import SparkContext
        File ".../spark/python/pyspark/context.py", line 36, in <module>
          from pyspark.java_gateway import launch_gateway
        File ".../spark/python/pyspark/java_gateway.py", line 31, in <module>
          from py4j.java_gateway import java_import, JavaGateway, GatewayClient
        File "<frozen importlib._bootstrap>", line 961, in _find_and_load
        File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
        File "<frozen importlib._bootstrap>", line 646, in _load_unlocked
        File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible
        File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 18, in <module>
        File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pydoc.py", line 62, in <module>
          import pkgutil
        File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pkgutil.py", line 22, in <module>
          ModuleInfo = namedtuple('ModuleInfo', 'module_finder name ispkg')
        File ".../spark/python/pyspark/serializers.py", line 394, in namedtuple
          cls = _old_namedtuple(*args, **kwargs)
      TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'

      The problem is in https://github.com/apache/spark/blob/3c68944b229aaaeeaee3efcbae3e3be9a2914855/python/pyspark/serializers.py#L386-L394 as the error says and the cause seems because the arguments of namedtuple are now completely keyword-only arguments from Python 3.6.0 (See https://bugs.python.org/issue25628).

      We currently copy this function via types.FunctionType which does not set the default values of keyword-only arguments (meaning namedtuple._kwdefaults_) and this seems causing internally missing values in the function (non-bound arguments).

      This ends up as below:

      import types
      import collections
      def _copy_func(f):
          return types.FunctionType(f.__code__, f.__globals__, f.__name__,
              f.__defaults__, f.__closure__)
      _old_namedtuple = _copy_func(collections.namedtuple)
      _old_namedtuple(, "b")

      If we call as below:

      >>> _old_namedtuple("a", "b")
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
      TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'

      It throws an exception as above becuase _kwdefaults_ for required keyword arguments seem unset in the copied function. So, if we give explicit value for these,

      >>> _old_namedtuple("a", "b", verbose=False, rename=False, module=None)
      <class '__main__.a'>

      It works fine.

      It seems now we should properly set these into the hijected one.



          This comment will be Viewable by All Users Viewable by All Users


            gurwls223 Hyukjin Kwon
            gurwls223 Hyukjin Kwon
            4 Vote for this issue
            12 Start watching this issue




                Issue deployment