test("Pushdown demo") {
withTable("t") {
withTempDir { loc =>
sql(
s"""CREATE TABLE t(c1 INT, s STRING) PARTITIONED BY(P1 STRING)
| LOCATION '${loc.getAbsolutePath}'
|""".stripMargin)
sql(
"""
|SELECT c1,
|case
| when get_json_object(s,'$.a')=1 then "a"
| when get_json_object(s,'$.a')=2 then "b"
|end as s_type
|FROM t
|WHERE get_json_object(s,'$.a') in (1, 2)
""".stripMargin).explain(true)
}
}
}
will got plan as
== Physical Plan ==
*(1) Project [c1#1, CASE WHEN (cast(get_json_object(s#2, $.a) as int) = 1) THEN a WHEN (cast(get_json_object(s#2, $.a) as int) = 2) THEN b END AS s_type#0]
+- *(1) Filter get_json_object(s#2, $.a) IN (1,2)
+- Scan hive default.t [c1#1, s#2], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#1, s#2], [P1#3], Statistics(sizeInBytes=8.0 EiB)
we can see that get_json_object(s#2, $.a) will be computed tree times
Always there are expensive expressions are re-computed many times in such grammar。