Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Duplicate
-
2.1.0
-
None
-
None
Description
The documenation for ScalaUDF says:
Note that if you use primitive parameters, you are not able to check if it is null or not, and the UDF will return null for you if the primitive input is null. Use boxed type or [[Option]] if you wanna do the null-handling yourself.
This works with boxed types:
import org.apache.spark.sql.functions.{col, udf} import spark.implicits._ def is_null_box(x:java.lang.Long):String = { x match { case _:java.lang.Long => "Yep" case null => "No man" } } val is_null_box_udf = udf(is_null_box _) val sample = (1L to 5L).toList.map(x=>new java.lang.Long(x))++List[java.lang.Long](null, null) val df = sample.toDF("col1") df.select(is_null_box_udf(col("col1"))).show(10)
But does not work with Option[Long] as expected:
import org.apache.spark.sql.functions.{col, udf} import spark.implicits._ def is_null_opt(x:Option[Long]):String = { x match { case Some(_:Long) => "Yep" case None => "No man" } } val is_null_opt_udf = udf(is_null_opt _) val sample = (1L to 5L) // This does not help: val sample = (1L to 5L).map(Some(_)).toList val df = sample.toDF("col1") df.select(is_null_opt_udf(col("col1"))).show(10)
That throws:
Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to scala.Option at $anonfun$1.apply(<console>:56) at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:89) at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:88) at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1069)
The current workaround is to use boxed types but it makes for code that looks funny.
If you just use Long instead of boxing the code may break in subtle ways (i.e. it does not fail it returns null). That's documented but easy to miss (i.e. not part of the bug but if someone "corrects" boxed functions to use primitive types then they might get surprising results):
import org.apache.spark.sql.functions.{col, udf, expr} import spark.implicits._ def is_null_opt(x:Long):String = { Option(x) match { case Some(_:Long) => "Yep" case None => "No man" } } val is_null_opt_udf = udf(is_null_opt _) val sample = (1L to 5L) val df = sample.toDF("col3").select(expr("CASE WHEN col3=2 THEN NULL ELSE col3 END").alias("col3")) df.printSchema df.select(is_null_opt_udf(col("col3"))).show(10)
Attachments
Issue Links
- duplicates
-
SPARK-12648 UDF with Option[Double] throws ClassCastException
- Resolved
- is related to
-
SPARK-11725 Let UDF to handle null value
- Resolved