Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20212

UDFs with Option[Primitive Type] don't work as expected

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Duplicate
    • 2.1.0
    • None
    • SQL
    • None

    Description

      The documenation for ScalaUDF says:

      Note that if you use primitive parameters, you are not able to check if it is
      null or not, and the UDF will return null for you if the primitive input is
      null. Use boxed type or [[Option]] if you wanna do the null-handling yourself.
      

      This works with boxed types:

      import org.apache.spark.sql.functions.{col, udf}
      import spark.implicits._
      
      def is_null_box(x:java.lang.Long):String = {
          x match {
              case _:java.lang.Long => "Yep"
              case null => "No man"
          }
      }
      val is_null_box_udf = udf(is_null_box _)
      val sample = (1L to 5L).toList.map(x=>new java.lang.Long(x))++List[java.lang.Long](null, null)
      val df = sample.toDF("col1")
      df.select(is_null_box_udf(col("col1"))).show(10)
      

      But does not work with Option[Long] as expected:

      import org.apache.spark.sql.functions.{col, udf}
      import spark.implicits._
      
      def is_null_opt(x:Option[Long]):String = {
          x match {
              case Some(_:Long) => "Yep"
              case None => "No man"
          }
      }
      val is_null_opt_udf = udf(is_null_opt _)
      val sample = (1L to 5L)
      // This does not help: val sample = (1L to 5L).map(Some(_)).toList
      val df = sample.toDF("col1")
      df.select(is_null_opt_udf(col("col1"))).show(10)
      

      That throws:

      Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to scala.Option
        at $anonfun$1.apply(<console>:56)
        at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:89)
        at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:88)
        at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1069)
      

      The current workaround is to use boxed types but it makes for code that looks funny.

      If you just use Long instead of boxing the code may break in subtle ways (i.e. it does not fail it returns null). That's documented but easy to miss (i.e. not part of the bug but if someone "corrects" boxed functions to use primitive types then they might get surprising results):

      import org.apache.spark.sql.functions.{col, udf, expr}
      import spark.implicits._
      
      def is_null_opt(x:Long):String = {
          Option(x) match {
              case Some(_:Long) => "Yep"
              case None => "No man"
          }
      }
      val is_null_opt_udf = udf(is_null_opt _)
      val sample = (1L to 5L)
      val df = sample.toDF("col3").select(expr("CASE WHEN col3=2 THEN NULL ELSE col3 END").alias("col3"))
      df.printSchema
      df.select(is_null_opt_udf(col("col3"))).show(10)
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              marius_feteanu Marius Feteanu
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: