Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.6.2, 2.0.0
Description
Reproducer:
case class E(subject: Long, predicate: String, objectNode: String) def test(sc: SparkContext) = { val sqlContext: SQLContext = new SQLContext(sc) import sqlContext.implicits._ val broken = List( (19157170390056969L, "right", 19157170390056969L), (19157170390056973L, "wrong", 19157170390056971L), (19157190254313477L, "wrong", 19157190254313475L), (19157180859056133L, "wrong", 19157180859056131L), (19157170390056969L, "number", 161), (19157170390056971L, "string", "a string"), (19157190254313475L, "string", "another string"), (19157180859056131L, "number", 191) ) val brokenDF = sc.parallelize(broken).map(b => E(b._1, b._2, b._3.toString)).toDF() val brokenFilter = brokenDF.filter($"subject" === $"objectNode") val fixed = brokenDF.filter(brokenDF("subject").cast("string") === brokenDF("objectNode")) println("***** incorrect filter results *****") println(brokenFilter.show()) println("***** correct filter results *****") println(fixed.show()) println("***** both sides cast to double *****") println(brokenFilter.explain()) } Broken filter returns: +-----------------+---------+-----------------+ | subject|predicate| objectNode| +-----------------+---------+-----------------+ |19157170390056969| right|19157170390056969| |19157170390056973| wrong|19157170390056971| |19157190254313477| wrong|19157190254313475| |19157180859056133| wrong|19157180859056131| +-----------------+---------+-----------------+
The physical plan shows both sides of the expression are being cast to Double before evaluation. So while comparing numbers to a string number appears to work in many cases, when the numbers are sufficiently large and close together there is enough loss of precision to cause incorrect results.
== Physical Plan == Filter (cast(subject#0L as double) = cast(objectNode#2 as double)) After casting the left side into strings, the filter returns the expected result: +-----------------+---------+-----------------+ | subject|predicate| objectNode| +-----------------+---------+-----------------+ |19157170390056969| right|19157170390056969| +-----------------+---------+-----------------+
Expected behavior in this case is probably to choose one side and cast the other (compare string to string or long to long) instead of using a data type with less precision.
Attachments
Issue Links
- duplicates
-
SPARK-19415 Improve the implicit type conversion between numeric type and string to avoid precesion loss
- Resolved
- is duplicated by
-
SPARK-19971 Wired SELECT equal behaviour.
- Closed
-
SPARK-18489 Implicit type conversion during comparision between Integer type column and String type column
- Closed
- is related to
-
SPARK-21646 Add new type coercion rules to compatible with Hive
- Resolved
- relates to
-
SPARK-25039 Binary comparison behavior should refer to Teradata
- Open
- links to