[SPARK-38614] Don't push down limit through window that's using percent_rank - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.2.0, 3.2.1, 3.3.0
Fix Version/s: 3.2.2, 3.3.1, 3.4.0
Component/s: PySpark, SQL
Labels:
- correctness

Language:
- Python

Description

Expected result is obtained using Spark 3.1.2, but not 3.2.0, 3.2.1 or 3.3.0.

Minimal reproducible example

from pyspark.sql import SparkSession, functions as F, Window as W
spark = SparkSession.builder.getOrCreate()
 
df = spark.range(101).withColumn('pr', F.percent_rank().over(W.orderBy('id')))
df.show(3)
df.show(5)

Expected result

+---+----+
| id|  pr|
+---+----+
|  0| 0.0|
|  1|0.01|
|  2|0.02|
+---+----+
only showing top 3 rows

+---+----+
| id|  pr|
+---+----+
|  0| 0.0|
|  1|0.01|
|  2|0.02|
|  3|0.03|
|  4|0.04|
+---+----+
only showing top 5 rows

Actual result

+---+------------------+
| id|                pr|
+---+------------------+
|  0|               0.0|
|  1|0.3333333333333333|
|  2|0.6666666666666666|
+---+------------------+
only showing top 3 rows

+---+---+
| id| pr|
+---+---+
|  0|0.0|
|  1|0.2|
|  2|0.4|
|  3|0.6|
|  4|0.8|
+---+---+
only showing top 5 rows

Attachments

Issue Links

relates to

SPARK-40002 Limit improperly pushed down through window using ntile function

Resolved

links to

[Github] Pull Request #36951 (bersprockets)

Activity

People

Assignee:: Bruce Robbins

Reporter:: ZygD

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 21/Mar/22 12:29

Updated:: 08/Aug/22 01:30

Resolved:: 23/Jun/22 00:53