Description
Applying an identity map transformation on a statically typed Dataset with a POJO produces an unexpected result.
Given POJOs:
public class Stuff implements Serializable { private String name; public void setName(String name) { this.name = name; } public String getName() { return name; } } public class Outer implements Serializable { private String name; private Stuff stuff; public void setName(String name) { this.name = name; } public String getName() { return name; } public void setStuff(Stuff stuff) { this.stuff = stuff; } public Stuff getStuff() { return stuff; } }
Produces the result:
scala> val encoder = Encoders.bean(classOf[Outer]) encoder: org.apache.spark.sql.Encoder[pojos.Outer] = class[name[0]: string, stuff[0]: struct<name:string>] scala> val schema = encoder.schema schema: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(stuff,StructType(StructField(name,StringType,true)),true)) scala> schema.printTreeString root |-- name: string (nullable = true) |-- stuff: struct (nullable = true) | |-- name: string (nullable = true) scala> val df = spark.read.schema(schema).json("stuff.json").as[Outer](encoder) df: org.apache.spark.sql.Dataset[pojos.Outer] = [name: string, stuff: struct<name: string>] scala> df.show() +----+-----+ |name|stuff| +----+-----+ | v1| null| +----+-----+ scala> df.map(x => x)(encoder).show() +----+------+ |name| stuff| +----+------+ | v1|[null]| +----+------+
After identity transformation, `stuff` becomes an object with null values inside it instead of staying null itself.
Doing the same with case classes preserves the nulls:
scala> case class ScalaStuff(name: String) defined class ScalaStuff scala> case class ScalaOuter(name: String, stuff: ScalaStuff) defined class ScalaOuter scala> val encoder2 = Encoders.product[ScalaOuter] encoder2: org.apache.spark.sql.Encoder[ScalaOuter] = class[name[0]: string, stuff[0]: struct<name:string>] scala> val schema2 = encoder2.schema schema2: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(stuff,StructType(StructField(name,StringType,true)),true)) scala> schema2.printTreeString root |-- name: string (nullable = true) |-- stuff: struct (nullable = true) | |-- name: string (nullable = true) scala> scala> val df2 = spark.read.schema(schema2).json("stuff.json").as[ScalaOuter] df2: org.apache.spark.sql.Dataset[ScalaOuter] = [name: string, stuff: struct<name: string>] scala> df2.show() +----+-----+ |name|stuff| +----+-----+ | v1| null| +----+-----+ scala> df2.map(x => x).show() +----+-----+ |name|stuff| +----+-----+ | v1| null| +----+-----+
stuff.json:
{"name":"v1", "stuff":null }