Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-4882

PySpark broadcast breaks when using KryoSerializer

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.2.0, 1.3.0
    • 1.2.1, 1.3.0
    • PySpark
    • None

    Description

      When KryoSerializer is used, PySpark will throw NullPointerException when trying to send broadcast variables to workers. This issue does not occur when the master is local, or when using the default JavaSerializer.

      Reproduction:

      Run

      SPARK_LOCAL_IP=127.0.0.1 ./bin/pyspark --master local-cluster[2,2,512] --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
      

      then run

      b = sc.broadcast("hello")
      sc.parallelize([0]).flatMap(lambda x: b.value).collect()
      

      This job fails because all tasks throw the following exception:

      14/12/28 14:26:08 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID 8, localhost): java.lang.NullPointerException
      	at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:589)
      	at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:232)
      	at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:228)
      	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
      	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
      	at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
      	at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
      	at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:228)
      	at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:203)
      	at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:203)
      	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1515)
      	at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:202)
      

      KryoSerializer may be enabled in the spark-defaults.conf file, so users may hit this error and be confused.

      Workaround:

      Override the spark.serializer setting to use the default Java serializer.

      Attachments

        Issue Links

          Activity

            People

              joshrosen Josh Rosen
              coderfi Fi
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: