Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-11003

Allowing UserDefinedTypes to extend primatives

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Incomplete
    • 1.5.0, 1.5.1
    • None
    • SQL

    Description

      Currently, the classes and constructors of all the primative DataTypes (of StructFields) are private:

      https://github.com/apache/spark/tree/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types

      Which means for even simple String-based UDTs users will always have to implement serialize() and deserialize(). UDTs for something as simple as a Northwind database (products, orders, customers) would be very useful for pattern matching / validation. For example:

      import org.apache.spark.sql.types._
      @SQLUserDefinedType(udt = classOf[ProductNameUDT])
      case class ProductName(name: String) extends StringType with Validator {
      import scala.util.matching.Regex
      private val pattern = """[A-Z][A-Za-z]*"""
      def validate(): Boolean = {
      name match

      { case pattern(_*) => true case _ => false }

      }
      }

      class ProductNameUDT extends UserDefinedType[ProductName] {
      // No need for this; ProductName is a StringType so we know how to deserialize
      override def serialize(p: Any): Any = {
      p match

      { case p: ProductName => Seq(p.name) }

      }

      // Not sure why this override is needed at all; can't we always get this simply by the UDT type param?
      override def userClass: Class[ProductName] = classOf[ProductName]

      // Instead of the below, just infer the StructField name via reflection of the wrapper class' name
      override def sqlType: DataType = StructType(Seq(StructField("ProductName", StringType)))

      // Still needed.
      override def deserialize(datum: Any): ProductName = {
      datum match

      { case values: Seq[_] => assert(values.length == 1) ProductName(values.head.asInstanceOf[String]) }

      }
      }

      This would simplify the process of creating "primative extension" UDTs down to just 2 steps:
      1. Annotated case class that extends a primative DataType
      2. The UDT itself just needs a deserializer

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              blue666man John Muller
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: