Description
When multiple records have the minimum value, the answer of ApproximatePercentile is wrong.
Suppose we have a table with 12 records and 4 partitions, values of column "col" in these partitions are:
1, 1, 2
1, 1, 3
1, 1, 4
1, 1, 5
If we query percentile_approx(col, array(0.5)), the current answer is "5", which is far from the correct answer "1".
The test case is as below:
test("percentile_approx, multiple records with the minimum value in a partition") { withTempView(table) { spark.sparkContext.makeRDD(Seq(1, 1, 2, 1, 1, 3, 1, 1, 4, 1, 1, 5), 4).toDF("col") .createOrReplaceTempView(table) checkAnswer( spark.sql(s"SELECT percentile_approx(col, array(0.5)) FROM $table"), Row(Seq(1.0D)) ) } }
Attachments
Issue Links
- is duplicated by
-
SPARK-18221 Wrong ApproximatePercentile answer when multiple records have the minimum value(for branch 2.0)
- Resolved
- links to