Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
1.5.2
-
None
Description
Problem
It is not possible to apply a UDF to a column that has a struct data type. Two previous requests to the mailing list remained unanswered.
How-To-Reproduce
val sql = new org.apache.spark.sql.SQLContext(sc) import sql.implicits._ case class KV(key: Long, value: String) case class Row(kv: KV) val df = sc.parallelize(List(Row(KV(1L, "a")), Row(KV(5L, "b")))).toDF val udf1 = org.apache.spark.sql.functions.udf((kv: KV) => kv.value) df.select(udf1(df("kv"))).show // java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to $line78.$read$$iwC$$iwC$KV val udf2 = org.apache.spark.sql.functions.udf((kv: (Long, String)) => kv._2) df.select(udf2(df("kv"))).show // org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to data type mismatch: argument 1 requires struct<_1:bigint,_2:string> type, however, 'kv' is of struct<key:bigint,value:string> type.;
Mailing List Entries
Possible Workaround
If you create a UserDefinedFunction manually, not using the udf helper functions, it works. See https://github.com/FRosner/struct-udf, which exposes the UserDefinedFunction constructor (public from package private). However, then you have to work with a Row, because it does not automatically convert the row to a case class / tuple.
Attachments
Issue Links
- is duplicated by
-
SPARK-12809 Spark SQL UDF does not work with struct input parameters
- Resolved
- relates to
-
SPARK-12809 Spark SQL UDF does not work with struct input parameters
- Resolved
-
SPARK-18884 Support Array[_] in ScalaUDF
- Resolved