Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32110

-0.0 vs 0.0 is inconsistent

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.0.0
    • Fix Version/s: 3.0.2, 3.1.0
    • Component/s: SQL
    • Labels:
      None

      Description

      This is related to SPARK-26021 where some things were fixed but there is still a lot that is not consistent.

      When parsing SQL -0.0 is turned into 0.0. This can produce quick results that appear to be correct but are totally inconsistent for the same operators.

      scala> import spark.implicits._
      import spark.implicits._
      
      scala> spark.sql("SELECT 0.0 = -0.0").collect
      res0: Array[org.apache.spark.sql.Row] = Array([true])
      
      scala> Seq((0.0, -0.0)).toDF("a", "b").selectExpr("a = b").collect
      res1: Array[org.apache.spark.sql.Row] = Array([false])
      

      This also shows up in sorts

      scala> Seq((0.0, -100.0), (-0.0, 100.0), (0.0, 100.0), (-0.0, -100.0)).toDF("a", "b").orderBy("a", "b").collect
      res2: Array[org.apache.spark.sql.Row] = Array([-0.0,-100.0], [-0.0,100.0], [0.0,-100.0], [0.0,100.0])
      

      But not for a equi-join or for an aggregate

      scala> Seq((0.0, -0.0)).toDF("a", "b").join(Seq((-0.0, 0.0)).toDF("r_a", "r_b"), $"a" === $"r_a").collect
      res3: Array[org.apache.spark.sql.Row] = Array([0.0,-0.0,-0.0,0.0])
      
      scala> Seq((0.0, 1.0), (-0.0, 1.0)).toDF("a", "b").groupBy("a").count.collect
      res6: Array[org.apache.spark.sql.Row] = Array([0.0,2])
      

      This can lead to some very odd results. Like an equi-join with a filter that logically should do nothing, but ends up filtering the result to nothing.

      scala> Seq((0.0, -0.0)).toDF("a", "b").join(Seq((-0.0, 0.0)).toDF("r_a", "r_b"), $"a" === $"r_a" && $"a" <= $"r_a").collect
      res8: Array[org.apache.spark.sql.Row] = Array()
      
      scala> Seq((0.0, -0.0)).toDF("a", "b").join(Seq((-0.0, 0.0)).toDF("r_a", "r_b"), $"a" === $"r_a").collect
      res9: Array[org.apache.spark.sql.Row] = Array([0.0,-0.0,-0.0,0.0])
      

      Hive never normalizes -0.0 to 0.0 so this results in non-ieee complaint behavior everywhere, but at least it is consistently odd.

      MySQL, Oracle, Postgres, and SQLite all appear to normalize the -0.0 to 0.0.

      The root cause of this appears to be that the java implementation of Double.compare and Float.compare for open JDK places -0.0 < 0.0.

      This is not documented in the java docs but it is clearly documented in the code, so it is not a "bug" that java is going to fix.

      https://github.com/openjdk/jdk/blob/a0a0539b0d3f9b6809c9759e697bfafd7b138ec1/src/java.base/share/classes/java/lang/Double.java#L1022-L1035

      It is also consistent with what is in the java docs for Double.equals
      https://docs.oracle.com/javase/8/docs/api/java/lang/Double.html#equals-java.lang.Object-

      To be clear I am filing this mostly to document the current state rather than to think it needs to be fixed ASAP. It is a rare corner case, but ended up being really frustrating for me to debug what was happening.

        Attachments

          Activity

            People

            • Assignee:
              cloud_fan Wenchen Fan
              Reporter:
              revans2 Robert Joseph Evans

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment