Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-5361

Multiple Java RDD <-> Python RDD conversions not working correctly

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.2.0
    • 1.2.2, 1.3.0
    • PySpark
    • None

    Description

      This is found through reading RDD from `sc.newAPIHadoopRDD` and writing it back using `rdd.saveAsNewAPIHadoopFile` in pyspark.

      It turns out that whenever there are multiple RDD conversions from JavaRDD to PythonRDD then back to JavaRDD, the exception below happens:

      15/01/16 10:28:31 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 7)
      java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to java.util.ArrayList
      	at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:157)
      	at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153)
      	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
      	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
      

      The test case code below reproduces it:

      from pyspark.rdd import RDD
      
      dl = [
          (u'2', {u'director': u'David Lean'}), 
          (u'7', {u'director': u'Andrew Dominik'})
      ]
      
      dl_rdd = sc.parallelize(dl)
      tmp = dl_rdd._to_java_object_rdd()
      tmp2 = sc._jvm.SerDe.javaToPython(tmp)
      t = RDD(tmp2, sc)
      t.count()
      
      tmp = t._to_java_object_rdd()
      tmp2 = sc._jvm.SerDe.javaToPython(tmp)
      t = RDD(tmp2, sc)
      t.count() # it blows up here during the 2nd time of conversion
      

      Attachments

        Activity

          People

            wingchen Winston Chen
            wingchen Winston Chen
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: