Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-35563

[SQL] Window operations with over Int.MaxValue + 1 rows can silently drop rows



    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.0.2
    • None
    • SQL


      I think this impacts a lot more versions of Spark, but I don't know for sure because it takes a long time to test. As a part of doing corner case validation testing for spark rapids I found that if a window function has more than Int.MaxValue + 1 rows the result is silently truncated to that many rows. I have only tested this on 3.0.2 with row_number, but I suspect it will impact others as well. This is a really rare corner case, but because it is silent data corruption I personally think it is quite serious.

      import org.apache.spark.sql.expressions.Window
      val windowSpec = Window.partitionBy("a").orderBy("b")
      val df = spark.range(Int.MaxValue.toLong + 100).selectExpr(s"1 as a", "id as b")
      spark.time(df.select(col("a"), col("b"), row_number().over(windowSpec).alias("rn")).orderBy(desc("a"), desc("b")).select((col("rn") < 0).alias("dir")).groupBy("dir").count.show(20))
      |  dir|     count|
      | true|         1|
      Time taken: 1139089 ms
      Int.MaxValue.toLong + 100
      res15: Long = 2147483747
      2147483647L + 1
      res16: Long = 2147483648

      I had to make sure that I ran the above with at least 64GiB of heap for the executor (I did it in local mode and it worked, but took forever to run)




            Unassigned Unassigned
            revans2 Robert Joseph Evans
            0 Vote for this issue
            10 Start watching this issue