[SPARK-48652] Casting Issue in Spark SQL: String Column Compared to Integer Value Yields Empty Results - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Blocker
Resolution: Unresolved
Affects Version/s: 3.3.2
Fix Version/s: None
Component/s: Spark Core, SQL
Labels:
- newbie
- pull-request-available

Description

In Spark SQL, comparing a string column to an integer value can lead to unexpected results due to type casting resulting in an empty result set.

case class Person(id: String, name: String)
val personDF = Seq(Person("a", "amit"), Person("b", "abhishek")).toDF()
personDF.createOrReplaceTempView("person_ddf")
val sqlQuery = "SELECT * FROM person_ddf WHERE id <> -1"
val resultDF = spark.sql(sqlQuery)
resultDF.show() // Empty result due to type casting issue

Below is the logical and physical plan which I m getting

== Parsed Logical Plan ==
'Project [*]
+- 'Filter NOT ('id = -1)
   +- 'UnresolvedRelation [person_ddf], [], false

== Analyzed Logical Plan ==
id: string, name: string
Project [id#356, name#357]
+- Filter NOT (cast(id#356 as int) = -1)
   +- SubqueryAlias person_ddf
      +- View (`person_ddf`, [id#356,name#357])
         +- LocalRelation [id#356, name#357]

But when I m using the same query and table in Redshift which is based on PostGreSQL. I am getting the desired result.

select * from person where id <> -1;

Explain plan obtained in Redshift.

XN Seq Scan on person  (cost=0.00..0.03 rows=1 width=336)
  Filter: ((id)::text <> '-1'::text)

In the execution plan for Spark, the ID column is cast as an integer, while in Redshift, the ID column is cast as a varchar.

Shouldn't Spark SQL handle this the same way as Redshift, using the datatype of the ID column rather than the datatype of -1?

Attachments

Issue Links

links to

GitHub Pull Request #47246

Activity

People

Assignee:: Unassigned

Reporter:: Abhishek Singh

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 18/Jun/24 12:45

Updated:: 09/Oct/24 15:38