[SPARK-20384] supporting value classes over primitives in DataSets - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 3.1.0
Fix Version/s: 3.3.0
Component/s: SQL
Labels:
None

Description

As a spark user who uses value classes in scala for modelling domain objects, I also would like to make use of them for datasets.

For example, I would like to use the User case class which is using a value-class for it's id as the type for a DataSet:

the underlying primitive should be mapped to the value-class column
function on the column (for example comparison ) should only work if defined on the value-class and use these implementation
show() should pick up the toString method of the value-class

case class Id(value: Long) extends AnyVal {
  def toString: String = value.toHexString
}
case class User(id: Id, name: String)

val ds = spark.sparkContext
  .parallelize(0L to 12L).map(i => (i, f"name-$i")).toDS()
  .withColumnRenamed("_1", "id")
  .withColumnRenamed("_2", "name")

// mapping should work
val usrs = ds.as[User]

// show should use toString
usrs.show()

// comparison with long should throw exception, as not defined on Id
usrs.col("id") > 0L

For example `.show()` should use the toString of the `Id` value class:

+---+-------+
| id|   name|
+---+-------+
|  0| name-0|
|  1| name-1|
|  2| name-2|
|  3| name-3|
|  4| name-4|
|  5| name-5|
|  6| name-6|
|  7| name-7|
|  8| name-8|
|  9| name-9|
|  A|name-10|
|  B|name-11|
|  C|name-12|
+---+-------+