Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-5361

Multiple Java RDD <-> Python RDD conversions not working correctly

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.2.0
    • 1.2.2, 1.3.0
    • PySpark
    • None

    Description

      This is found through reading RDD from `sc.newAPIHadoopRDD` and writing it back using `rdd.saveAsNewAPIHadoopFile` in pyspark.

      It turns out that whenever there are multiple RDD conversions from JavaRDD to PythonRDD then back to JavaRDD, the exception below happens:

      15/01/16 10:28:31 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 7)
      java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to java.util.ArrayList
      	at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:157)
      	at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153)
      	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
      	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
      

      The test case code below reproduces it:

      from pyspark.rdd import RDD
      
      dl = [
          (u'2', {u'director': u'David Lean'}), 
          (u'7', {u'director': u'Andrew Dominik'})
      ]
      
      dl_rdd = sc.parallelize(dl)
      tmp = dl_rdd._to_java_object_rdd()
      tmp2 = sc._jvm.SerDe.javaToPython(tmp)
      t = RDD(tmp2, sc)
      t.count()
      
      tmp = t._to_java_object_rdd()
      tmp2 = sc._jvm.SerDe.javaToPython(tmp)
      t = RDD(tmp2, sc)
      t.count() # it blows up here during the 2nd time of conversion
      

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            wingchen Winston Chen
            wingchen Winston Chen
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment