Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Incomplete
-
1.5.0, 1.5.1
-
None
Description
Currently, the classes and constructors of all the primative DataTypes (of StructFields) are private:
https://github.com/apache/spark/tree/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types
Which means for even simple String-based UDTs users will always have to implement serialize() and deserialize(). UDTs for something as simple as a Northwind database (products, orders, customers) would be very useful for pattern matching / validation. For example:
import org.apache.spark.sql.types._
@SQLUserDefinedType(udt = classOf[ProductNameUDT])
case class ProductName(name: String) extends StringType with Validator {
import scala.util.matching.Regex
private val pattern = """[A-Z][A-Za-z]*"""
def validate(): Boolean = {
name match
}
}
class ProductNameUDT extends UserDefinedType[ProductName] {
// No need for this; ProductName is a StringType so we know how to deserialize
override def serialize(p: Any): Any = {
p match
}
// Not sure why this override is needed at all; can't we always get this simply by the UDT type param?
override def userClass: Class[ProductName] = classOf[ProductName]
// Instead of the below, just infer the StructField name via reflection of the wrapper class' name
override def sqlType: DataType = StructType(Seq(StructField("ProductName", StringType)))
// Still needed.
override def deserialize(datum: Any): ProductName = {
datum match
}
}
This would simplify the process of creating "primative extension" UDTs down to just 2 steps:
1. Annotated case class that extends a primative DataType
2. The UDT itself just needs a deserializer
Attachments
Issue Links
- is part of
-
SPARK-11010 Fixes and enhancements addressing UDTs' api and several usability concerns
- Resolved
- is related to
-
SPARK-14487 User Defined Type registration without SQLUserDefinedType annotation
- Resolved