Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21753

running pi example with pypy on spark fails to serialize

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.1.1
    • Fix Version/s: 2.3.0
    • Component/s: PySpark
    • Labels:
      None

      Description

      I'm trying to run the pi example (https://github.com/apache/spark/blob/master/examples/src/main/python/pi.py) on pyspark using pypy 2.5.1 but everything I've tried results in a serialization error:

      Traceback (most recent call last):
      File "/home/tgraves/y-spark-git/python/pyspark/cloudpickle.py", line 147, in dump
      return Pickler.dump(self, obj)
      File "//home/tgraves/pypy-my-own-package-name/lib-python/2.7/pickle.py", line 224, in dump
      self.save(obj)
      File "//home/tgraves/pypy-my-own-package-name/lib-python/2.7/pickle.py", line 286, in save
      f(self, obj) # Call unbound method with explicit self
      File "//home/tgraves/pypy-my-own-package-name/lib-python/2.7/pickle.py", line 562, in save_tuple
      save(element)
      File "//home/tgraves/pypy-my-own-package-name/lib-python/2.7/pickle.py", line 286, in save
      f(self, obj) # Call unbound method with explicit self
      File "/home/tgraves/y-spark-git/python/pyspark/cloudpickle.py", line 254, in save_function
      self.save_function_tuple(obj)
      File "/home/tgraves/y-spark-git/python/pyspark/cloudpickle.py", line 291, in save_function_tuple
      save((code, closure, base_globals))
      File "//home/tgraves/pypy-my-own-package-name/lib-python/2.7/pickle.py", line 286, in save
      f(self, obj) # Call unbound method with explicit self
      File "//home/tgraves/pypy-my-own-package-name/lib-python/2.7/pickle.py", line 548, in save_tuple
      save(element)
      File "//home/tgraves/pypy-my-own-package-name/lib-python/2.7/pickle.py", line 286, in save
      f(self, obj) # Call unbound method with explicit self
      File "//home/tgraves/pypy-my-own-package-name/lib-python/2.7/pickle.py", line 600, in save_list
      self._batch_appends(iter(obj))
      File "//home/tgraves/pypy-my-own-package-name/lib-python/2.7/pickle.py", line 633, in _batch_appends
      save
      File "//home/tgraves/pypy-my-own-package-name/lib-python/2.7/pickle.py", line 286, in save
      f(self, obj) # Call unbound method with explicit self
      File "/home/tgraves/y-spark-git/python/pyspark/cloudpickle.py", line 254, in save_function
      self.save_function_tuple(obj)
      File "/home/tgraves/y-spark-git/python/pyspark/cloudpickle.py", line 291, in save_function_tuple
      save((code, closure, base_globals))
      File "//home/tgraves/pypy-my-own-package-name/lib-python/2.7/pickle.py", line 286, in save
      f(self, obj) # Call unbound method with explicit self
      File "//home/tgraves/pypy-my-own-package-name/lib-python/2.7/pickle.py", line 548, in save_tuple
      save(element)
      File "//home/tgraves/pypy-my-own-package-name/lib-python/2.7/pickle.py", line 286, in save
      f(self, obj) # Call unbound method with explicit self
      File "//home/tgraves/pypy-my-own-package-name/lib-python/2.7/pickle.py", line 600, in save_list
      self._batch_appends(iter(obj))
      File "//home/tgraves/pypy-my-own-package-name/lib-python/2.7/pickle.py", line 636, in _batch_appends
      save(tmp[0])
      File "//home/tgraves/pypy-my-own-package-name/lib-python/2.7/pickle.py", line 286, in save
      f(self, obj) # Call unbound method with explicit self
      File "/home/tgraves/y-spark-git/python/pyspark/cloudpickle.py", line 248, in save_function
      self.save_function_tuple(obj)
      File "/home/tgraves/y-spark-git/python/pyspark/cloudpickle.py", line 296, in save_function_tuple
      save(f_globals)
      File "//home/tgraves/pypy-my-own-package-name/lib-python/2.7/pickle.py", line 286, in save
      f(self, obj) # Call unbound method with explicit self
      File "//home/tgraves/pypy-my-own-package-name/lib-python/2.7/pickle.py", line 653, in save_dict
      self._batch_setitems(obj.iteritems())
      File "//home/tgraves/pypy-my-own-package-name/lib-python/2.7/pickle.py", line 690, in _batch_setitems
      save(v)
      File "//home/tgraves/pypy-my-own-package-name/lib-python/2.7/pickle.py", line 286, in save
      f(self, obj) # Call unbound method with explicit self
      File "/home/tgraves/y-spark-git/python/pyspark/cloudpickle.py", line 447, in save_instancemethod
      obj=obj)
      File "/home/tgraves/y-spark-git/python/pyspark/cloudpickle.py", line 581, in save_reduce
      save(args)
      File "//home/tgraves/pypy-my-own-package-name/lib-python/2.7/pickle.py", line 286, in save
      f(self, obj) # Call unbound method with explicit self
      File "//home/tgraves/pypy-my-own-package-name/lib-python/2.7/pickle.py", line 548, in save_tuple
      save(element)
      File "//home/tgraves/pypy-my-own-package-name/lib-python/2.7/pickle.py", line 286, in save
      f(self, obj) # Call unbound method with explicit self
      File "/home/tgraves/y-spark-git/python/pyspark/cloudpickle.py", line 246, in save_function
      if islambda(obj) or obj._code_.co_filename == '<stdin>' or themodule is None:
      AttributeError: 'builtin-code' object has no attribute 'co_filename'
      Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/tgraves/y-spark-git/python/pyspark/rdd.py", line 834, in reduce
      vals = self.mapPartitions(func).collect()
      File "/home/tgraves/y-spark-git/python/pyspark/rdd.py", line 808, in collect
      port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
      File "/home/tgraves/y-spark-git/python/pyspark/rdd.py", line 2440, in _jrdd
      self._jrdd_deserializer, profiler)
      File "/home/tgraves/y-spark-git/python/pyspark/rdd.py", line 2373, in _wrap_function
      pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
      File "/home/tgraves/y-spark-git/python/pyspark/rdd.py", line 2359, in _prepare_for_python_RDD
      pickled_command = ser.dumps(command)
      File "/home/tgraves/y-spark-git/python/pyspark/serializers.py", line 460, in dumps
      return cloudpickle.dumps(obj, 2)
      File "/home/tgraves/y-spark-git/python/pyspark/cloudpickle.py", line 703, in dumps
      cp.dump(obj)
      File "/home/tgraves/y-spark-git/python/pyspark/cloudpickle.py", line 160, in dump
      raise pickle.PicklingError(msg)

      It looks like the issue is with serializing random(). If you remove random() from the function then everything works fine.

      I'm just running PYSPARK_PYTHON=//home/tgraves/pypy-my-own-package-name/bin/pypy ./bin/pyspark

      I've tried multiple versions of pypy from 2.5.1 to 5.8.0. I tried the portable version as well as built pypy from source.

      If it works for others perhaps I have a setup issue, any hints on that would be appreciated.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                rgbkrk Kyle Kelley
                Reporter:
                tgraves Thomas Graves
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: