Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-12823

Cannot create UDF with StructType input

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 1.5.2
    • None
    • SQL

    Description

      Problem

      It is not possible to apply a UDF to a column that has a struct data type. Two previous requests to the mailing list remained unanswered.

      How-To-Reproduce
      val sql = new org.apache.spark.sql.SQLContext(sc)
      import sql.implicits._
      
      case class KV(key: Long, value: String)
      case class Row(kv: KV)
      val df = sc.parallelize(List(Row(KV(1L, "a")), Row(KV(5L, "b")))).toDF
      
      val udf1 = org.apache.spark.sql.functions.udf((kv: KV) => kv.value)
      df.select(udf1(df("kv"))).show
      // java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to $line78.$read$$iwC$$iwC$KV
      
      val udf2 = org.apache.spark.sql.functions.udf((kv: (Long, String)) => kv._2)
      df.select(udf2(df("kv"))).show
      // org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to data type mismatch: argument 1 requires struct<_1:bigint,_2:string> type, however, 'kv' is of struct<key:bigint,value:string> type.;
      
      Mailing List Entries
      Possible Workaround

      If you create a UserDefinedFunction manually, not using the udf helper functions, it works. See https://github.com/FRosner/struct-udf, which exposes the UserDefinedFunction constructor (public from package private). However, then you have to work with a Row, because it does not automatically convert the row to a case class / tuple.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              frosner Frank Rosner
              Votes:
              10 Vote for this issue
              Watchers:
              23 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: