Description
Cache SQL UNION of 2 sides with different column data types
scala> spark.sql("select 1 id union select 's2' id").cache()
Dataset.union does not leverage the cache
scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select 's3'")).queryExecution.optimizedPlan res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Union false, false :- Aggregate [id#109], [id#109] : +- Union false, false : :- Project [1 AS id#109] : : +- OneRowRelation : +- Project [s2 AS id#108] : +- OneRowRelation +- Project [s3 AS s3#111] +- OneRowRelation
SQL UNION of the cached SQL UNION does use the cache! Please note `InMemoryRelation` used.
scala> spark.sql("(select 1 id union select 's2' id) union select 's3'").queryExecution.optimizedPlan res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Aggregate [id#117], [id#117] +- Union false, false :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 replicas) : +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100]) : +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, [plan_id=241] : +- *(3) HashAggregate(keys=[id#100], functions=[], output=[id#100]) : +- Union : :- *(1) Project [1 AS id#100] : : +- *(1) Scan OneRowRelation[] : +- *(2) Project [s2 AS id#99] : +- *(2) Scan OneRowRelation[] +- Project [s3 AS s3#116] +- OneRowRelation