Details
-
Bug
-
Status: Open
-
Blocker
-
Resolution: Unresolved
-
3.3.2
-
None
Description
In Spark SQL, comparing a string column to an integer value can lead to unexpected results due to type casting resulting in an empty result set.
case class Person(id: String, name: String) val personDF = Seq(Person("a", "amit"), Person("b", "abhishek")).toDF() personDF.createOrReplaceTempView("person_ddf") val sqlQuery = "SELECT * FROM person_ddf WHERE id <> -1" val resultDF = spark.sql(sqlQuery) resultDF.show() // Empty result due to type casting issue
Below is the logical and physical plan which I m getting
== Parsed Logical Plan == 'Project [*] +- 'Filter NOT ('id = -1) +- 'UnresolvedRelation [person_ddf], [], false == Analyzed Logical Plan == id: string, name: string Project [id#356, name#357] +- Filter NOT (cast(id#356 as int) = -1) +- SubqueryAlias person_ddf +- View (`person_ddf`, [id#356,name#357]) +- LocalRelation [id#356, name#357]
But when I m using the same query and table in Redshift which is based on PostGreSQL. I am getting the desired result.
select * from person where id <> -1;
Explain plan obtained in Redshift.
XN Seq Scan on person (cost=0.00..0.03 rows=1 width=336)
Filter: ((id)::text <> '-1'::text)
In the execution plan for Spark, the ID column is cast as an integer, while in Redshift, the ID column is cast as a varchar.
Shouldn't Spark SQL handle this the same way as Redshift, using the datatype of the ID column rather than the datatype of -1?
Attachments
Issue Links
- links to