[SPARK-3572] Internal API for User-Defined Types - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.2.0
Component/s: SQL
Labels:
None

Target Version/s:

1.2.0

Description

If a user knows how to map a class to a struct type in Spark SQL, he should be able to register this mapping through sqlContext and hence SQL can figure out the schema automatically.

trait RowSerializer[T] {
  def dataType: StructType
  def serialize(obj: T): Row
  def deserialize(row: Row): T
}

sqlContext.registerUserType[T](clazz: classOf[T], serializer: classOf[RowSerializer[T]])

In sqlContext, we can maintain a class-to-serializer map and use it for conversion. The serializer class can be embedded into the metadata, so when `select` is called, we know we want to deserialize the result.

sqlContext.registerUserType(classOf[Vector], classOf[VectorRowSerializer])
val points: RDD[LabeledPoint] = ...
val features: RDD[Vector] = points.select('features).map { case Row(v: Vector) => v }