Details
Description
When KryoSerializer is used, PySpark will throw NullPointerException when trying to send broadcast variables to workers. This issue does not occur when the master is local, or when using the default JavaSerializer.
Reproduction:
Run
SPARK_LOCAL_IP=127.0.0.1 ./bin/pyspark --master local-cluster[2,2,512] --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
then run
b = sc.broadcast("hello")
sc.parallelize([0]).flatMap(lambda x: b.value).collect()
This job fails because all tasks throw the following exception:
14/12/28 14:26:08 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID 8, localhost): java.lang.NullPointerException at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:589) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:232) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:228) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:228) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:203) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:203) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1515) at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:202)
KryoSerializer may be enabled in the spark-defaults.conf file, so users may hit this error and be confused.
Workaround:
Override the spark.serializer setting to use the default Java serializer.
Attachments
Issue Links
- duplicates
-
SPARK-5779 Python broadcast does not work with Kryo serializer
- Closed
- links to