Description
We use SparkSQL and Catalyst to optimize the Spark job. We have read the source code and test the rule of SimplifyCasts which will work for simple SQL without nested cast.
The SQL "select cast(string_date as string) from t1" will be optimized.
== Analyzed Logical Plan == string_date: string Project [cast(string_date#12 as string) AS string_date#24] +- SubqueryAlias t1 +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, string_timestamp#13, timestamp_field#14, bool_field#15], false == Optimized Logical Plan == Project [string_date#12] +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, string_timestamp#13, timestamp_field#14, bool_field#15], false
However, it fail to optimize with the nested cast like this "select cast(cast(string_date as string) as string) from t1".
== Analyzed Logical Plan == CAST(CAST(string_date AS STRING) AS STRING): string Project [cast(cast(string_date#12 as string) as string) AS CAST(CAST(string_date AS STRING) AS STRING)#24] +- SubqueryAlias t1 +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, string_timestamp#13, timestamp_field#14, bool_field#15], false == Optimized Logical Plan == Project [string_date#12 AS CAST(CAST(string_date AS STRING) AS STRING)#24] +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, string_timestamp#13, timestamp_field#14, bool_field#15], false