Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16954

UDFs should allow output type to be specified in terms of the input type



    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • None
    • None
    • SQL


      Consider an array_compact UDF that removes null values from an array. There is no easy way to implement this UDF because an explicit return type with a TypeTag is required by code generation.

      The interesting observation here is that the output type of `array_compact` is the same as its input type. In general, there is a broad class of UDFs, especially collection-oriented ones, whose output types are functions of the input types. In our Spark work we have found collection manipulation UDFs to be very powerful for cleaning up data and substantially improving performance, in particular, avoiding explode followed by groupBy. It would be nice if Spark made adding these types of UDFs very easy.

      I won't go into possible ways to implement this under the covers as there are many options but I do want to point out that it is possible to communicate the right type information to Spark without changing the signature for UDF registration using placeholder types, e.g.,

      sealed trait UDFArgumentAtPosition
      case class ArgPos1 extends UDFArgumentAtPosition
      case class ArgPos2 extends UDFArgumentAtPosition
      // ...
      case class Struct[ArgPos <: UDFArgumentAtPosition](value: Row)
      case class ArrayElement[ArgPos <: UDFArgumentAtPosition, A : TypeTag](value: A)
      case class MapKey[ArgPos <: UDFArgumentAtPosition, A : TypeTag](value: A)
      case class MapValue[ArgPos <: UDFArgumentAtPosition, A : TypeTag](value: A)
      // Functions are stubbed just to show compilation succeeds
      def arrayCompact[A : TypeTag](xs: Seq[A]): ArgPos1 = null
      def arraySum[A : Numeric : TypeTag](xs: Seq[A]): ArrayElement[ArgPos1, A] = ArrayElement(implicitly[Numeric[A]].zero) 




            Unassigned Unassigned
            simeons Simeon Simeonov
            0 Vote for this issue
            1 Start watching this issue