Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-38614

Don't push down limit through window that's using percent_rank

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.2.0, 3.2.1, 3.3.0
    • 3.2.2, 3.3.1, 3.4.0
    • PySpark, SQL

    Description

      Expected result is obtained using Spark 3.1.2, but not 3.2.0, 3.2.1 or 3.3.0.

      Minimal reproducible example

      from pyspark.sql import SparkSession, functions as F, Window as W
      spark = SparkSession.builder.getOrCreate()
       
      df = spark.range(101).withColumn('pr', F.percent_rank().over(W.orderBy('id')))
      df.show(3)
      df.show(5) 

      Expected result

      +---+----+
      | id|  pr|
      +---+----+
      |  0| 0.0|
      |  1|0.01|
      |  2|0.02|
      +---+----+
      only showing top 3 rows
      
      +---+----+
      | id|  pr|
      +---+----+
      |  0| 0.0|
      |  1|0.01|
      |  2|0.02|
      |  3|0.03|
      |  4|0.04|
      +---+----+
      only showing top 5 rows

      Actual result

      +---+------------------+
      | id|                pr|
      +---+------------------+
      |  0|               0.0|
      |  1|0.3333333333333333|
      |  2|0.6666666666666666|
      +---+------------------+
      only showing top 3 rows
      
      +---+---+
      | id| pr|
      +---+---+
      |  0|0.0|
      |  1|0.2|
      |  2|0.4|
      |  3|0.6|
      |  4|0.8|
      +---+---+
      only showing top 5 rows

      Attachments

        Issue Links

          Activity

            People

              bersprockets Bruce Robbins
              ZygD ZygD
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: