Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-35563

[SQL] Window operations with over Int.MaxValue + 1 rows can silently drop rows

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Blocker
    • Resolution: Unresolved
    • 3.0.2
    • None
    • SQL

    Description

      I think this impacts a lot more versions of Spark, but I don't know for sure because it takes a long time to test. As a part of doing corner case validation testing for spark rapids I found that if a window function has more than Int.MaxValue + 1 rows the result is silently truncated to that many rows. I have only tested this on 3.0.2 with row_number, but I suspect it will impact others as well. This is a really rare corner case, but because it is silent data corruption I personally think it is quite serious.

      import org.apache.spark.sql.expressions.Window
      
      val windowSpec = Window.partitionBy("a").orderBy("b")
      
      val df = spark.range(Int.MaxValue.toLong + 100).selectExpr(s"1 as a", "id as b")
      
      spark.time(df.select(col("a"), col("b"), row_number().over(windowSpec).alias("rn")).orderBy(desc("a"), desc("b")).select((col("rn") < 0).alias("dir")).groupBy("dir").count.show(20))
      
      +-----+----------+                                                              
      |  dir|     count|
      +-----+----------+
      |false|2147483647|
      | true|         1|
      +-----+----------+
      
      Time taken: 1139089 ms
      
      Int.MaxValue.toLong + 100
      res15: Long = 2147483747
      
      2147483647L + 1
      res16: Long = 2147483648
      

      I had to make sure that I ran the above with at least 64GiB of heap for the executor (I did it in local mode and it worked, but took forever to run)

      Attachments

        Activity

          People

            Unassigned Unassigned
            revans2 Robert Joseph Evans
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated: