Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-1034

Py4JException on PySpark Cartesian Result

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.9.0
    • 0.9.0
    • PySpark
    • None

    Description

      RDD operations on results of the Pyspark Cartesian method return Py4JException.

      Here's a few examples

      $ bin/pyspark
      >>> rdd1=sc.parallelize([1,2,3,4,5,1])
      >>> rdd2=sc.parallelize([11,12,13,14,15,11])
      >>> rdd1.cartesian(rdd2).map(lambda x: x[0] + x[1]).collect()
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/Users/tshinagawa/Documents/Spark/RCs/spark-0.9.0-incubating-rc2/python/pyspark/rdd.py", line 446, in collect
          bytesInJava = self._jrdd.collect().iterator()
        File "/Users/tshinagawa/Documents/Spark/RCs/spark-0.9.0-incubating-rc2/python/pyspark/rdd.py", line 1041, in _jrdd
          class_tag)
        File "/Users/tshinagawa/Documents/Spark/RCs/spark-0.9.0-incubating-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py", line 669, in __call__
        File "/Users/tshinagawa/Documents/Spark/RCs/spark-0.9.0-incubating-rc2/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", line 304, in get_return_value
      py4j.protocol.Py4JError: An error occurred while calling None.org.apache.spark.api.python.PythonRDD. Trace:
      py4j.Py4JException: Constructor org.apache.spark.api.python.PythonRDD([class org.apache.spark.rdd.CartesianRDD, class [B, class java.util.HashMap, class java.util.ArrayList, class java.lang.Boolean, class java.lang.String, class java.util.ArrayList, class org.apache.spark.Accumulator, class scala.reflect.ManifestFactory$$anon$2]) does not exist
      	at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:184)
      	at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:202)
      	at py4j.Gateway.invoke(Gateway.java:213)
      	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
      	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
      	at py4j.GatewayConnection.run(GatewayConnection.java:207)
      	at java.lang.Thread.run(Thread.java:744)
      
      
      >>> 
      >>> rdd1.cartesian(rdd2).count()
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/Users/tshinagawa/Documents/Spark/RCs/spark-0.9.0-incubating-rc2/python/pyspark/rdd.py", line 525, in count
          return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
        File "/Users/tshinagawa/Documents/Spark/RCs/spark-0.9.0-incubating-rc2/python/pyspark/rdd.py", line 516, in sum
          return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
        File "/Users/tshinagawa/Documents/Spark/RCs/spark-0.9.0-incubating-rc2/python/pyspark/rdd.py", line 482, in reduce
          vals = self.mapPartitions(func).collect()
        File "/Users/tshinagawa/Documents/Spark/RCs/spark-0.9.0-incubating-rc2/python/pyspark/rdd.py", line 446, in collect
          bytesInJava = self._jrdd.collect().iterator()
        File "/Users/tshinagawa/Documents/Spark/RCs/spark-0.9.0-incubating-rc2/python/pyspark/rdd.py", line 1041, in _jrdd
          class_tag)
        File "/Users/tshinagawa/Documents/Spark/RCs/spark-0.9.0-incubating-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py", line 669, in __call__
        File "/Users/tshinagawa/Documents/Spark/RCs/spark-0.9.0-incubating-rc2/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", line 304, in get_return_value
      py4j.protocol.Py4JError: An error occurred while calling None.org.apache.spark.api.python.PythonRDD. Trace:
      py4j.Py4JException: Constructor org.apache.spark.api.python.PythonRDD([class org.apache.spark.rdd.CartesianRDD, class [B, class java.util.HashMap, class java.util.ArrayList, class java.lang.Boolean, class java.lang.String, class java.util.ArrayList, class org.apache.spark.Accumulator, class scala.reflect.ManifestFactory$$anon$2]) does not exist
      	at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:184)
      	at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:202)
      	at py4j.Gateway.invoke(Gateway.java:213)
      	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
      	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
      	at py4j.GatewayConnection.run(GatewayConnection.java:207)
      	at java.lang.Thread.run(Thread.java:744)
      

      I see this issue after the custom serializer change.
      https://github.com/apache/incubator-spark/commit/cbb7f04aef2220ece93dea9f3fa98b5db5f270d6

      Attachments

        Issue Links

          Activity

            People

              joshrosen Josh Rosen
              mrt Taka Shinagawa
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: