Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19019

PySpark does not work with Python 3.6.0

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.6.4, 2.0.3, 2.1.1, 2.2.0
    • Component/s: PySpark
    • Labels:
      None

      Description

      Currently, PySpark does not work with Python 3.6.0.

      Running ./bin/pyspark simply throws the error as below:

      Traceback (most recent call last):
        File ".../spark/python/pyspark/shell.py", line 30, in <module>
          import pyspark
        File ".../spark/python/pyspark/__init__.py", line 46, in <module>
          from pyspark.context import SparkContext
        File ".../spark/python/pyspark/context.py", line 36, in <module>
          from pyspark.java_gateway import launch_gateway
        File ".../spark/python/pyspark/java_gateway.py", line 31, in <module>
          from py4j.java_gateway import java_import, JavaGateway, GatewayClient
        File "<frozen importlib._bootstrap>", line 961, in _find_and_load
        File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
        File "<frozen importlib._bootstrap>", line 646, in _load_unlocked
        File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible
        File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 18, in <module>
        File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pydoc.py", line 62, in <module>
          import pkgutil
        File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pkgutil.py", line 22, in <module>
          ModuleInfo = namedtuple('ModuleInfo', 'module_finder name ispkg')
        File ".../spark/python/pyspark/serializers.py", line 394, in namedtuple
          cls = _old_namedtuple(*args, **kwargs)
      TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'
      

      The problem is in https://github.com/apache/spark/blob/3c68944b229aaaeeaee3efcbae3e3be9a2914855/python/pyspark/serializers.py#L386-L394 as the error says and the cause seems because the arguments of namedtuple are now completely keyword-only arguments from Python 3.6.0 (See https://bugs.python.org/issue25628).

      We currently copy this function via types.FunctionType which does not set the default values of keyword-only arguments (meaning namedtuple._kwdefaults_) and this seems causing internally missing values in the function (non-bound arguments).

      This ends up as below:

      import types
      import collections
      
      def _copy_func(f):
          return types.FunctionType(f.__code__, f.__globals__, f.__name__,
              f.__defaults__, f.__closure__)
      
      _old_namedtuple = _copy_func(collections.namedtuple)
      
      _old_namedtuple(, "b")
      _old_namedtuple("a")
      

      If we call as below:

      >>> _old_namedtuple("a", "b")
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
      TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'
      

      It throws an exception as above becuase _kwdefaults_ for required keyword arguments seem unset in the copied function. So, if we give explicit value for these,

      >>> _old_namedtuple("a", "b", verbose=False, rename=False, module=None)
      <class '__main__.a'>
      

      It works fine.

      It seems now we should properly set these into the hijected one.

        Issue Links

          Activity

          Hide
          apachespark Apache Spark added a comment -

          User 'HyukjinKwon' has created a pull request for this issue:
          https://github.com/apache/spark/pull/16429

          Show
          apachespark Apache Spark added a comment - User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/16429
          Hide
          davies Davies Liu added a comment -

          Issue resolved by pull request 16429
          https://github.com/apache/spark/pull/16429

          Show
          davies Davies Liu added a comment - Issue resolved by pull request 16429 https://github.com/apache/spark/pull/16429
          Hide
          zero323 Maciej Szymkiewicz added a comment -

          Davies Liu Could it be backported to 1.6 and 2.0?

          Show
          zero323 Maciej Szymkiewicz added a comment - Davies Liu Could it be backported to 1.6 and 2.0?
          Hide
          henrytxz Henry Zhang added a comment - - edited

          Would also be interested in the answer to Maciej's question (for 2.0) and when is 2.1.1 scheduled to be released? Thank you!

          Show
          henrytxz Henry Zhang added a comment - - edited Would also be interested in the answer to Maciej's question (for 2.0) and when is 2.1.1 scheduled to be released? Thank you!
          Hide
          hyukjin.kwon Hyukjin Kwon added a comment -

          Let me try to make a PR to backport this if this is confirmed.

          Show
          hyukjin.kwon Hyukjin Kwon added a comment - Let me try to make a PR to backport this if this is confirmed.
          Hide
          apachespark Apache Spark added a comment -

          User 'HyukjinKwon' has created a pull request for this issue:
          https://github.com/apache/spark/pull/17374

          Show
          apachespark Apache Spark added a comment - User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/17374
          Hide
          apachespark Apache Spark added a comment -

          User 'HyukjinKwon' has created a pull request for this issue:
          https://github.com/apache/spark/pull/17375

          Show
          apachespark Apache Spark added a comment - User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/17375
          Hide
          hyukjin.kwon Hyukjin Kwon added a comment -

          To solve this problem fully, I had to port cloudpickle change too in the PR. Only fixing hijected one described above dose not fully solve this issue. Please refer the discussion in the PR and the change.

          Show
          hyukjin.kwon Hyukjin Kwon added a comment - To solve this problem fully, I had to port cloudpickle change too in the PR. Only fixing hijected one described above dose not fully solve this issue. Please refer the discussion in the PR and the change.

            People

            • Assignee:
              hyukjin.kwon Hyukjin Kwon
              Reporter:
              hyukjin.kwon Hyukjin Kwon
            • Votes:
              4 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development