[SPARK-18111] Wrong ApproximatePercentile answer when multiple records have the minimum value - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.1
Fix Version/s: 2.0.3, 2.1.0
Component/s: SQL
Labels:
None

Description

When multiple records have the minimum value, the answer of ApproximatePercentile is wrong.

Suppose we have a table with 12 records and 4 partitions, values of column "col" in these partitions are:
1, 1, 2
1, 1, 3
1, 1, 4
1, 1, 5
If we query percentile_approx(col, array(0.5)), the current answer is "5", which is far from the correct answer "1".

The test case is as below:

  test("percentile_approx, multiple records with the minimum value in a partition") {
    withTempView(table) {
      spark.sparkContext.makeRDD(Seq(1, 1, 2, 1, 1, 3, 1, 1, 4, 1, 1, 5), 4).toDF("col")
        .createOrReplaceTempView(table)
      checkAnswer(
        spark.sql(s"SELECT percentile_approx(col, array(0.5)) FROM $table"),
        Row(Seq(1.0D))
      )
    }
  }

Attachments

Issue Links

is duplicated by

SPARK-18221 Wrong ApproximatePercentile answer when multiple records have the minimum value(for branch 2.0)

Resolved

links to

[Github] Pull Request #15641 (wzhfy)

[Github] Pull Request #15732 (wzhfy)

Activity

People

Assignee:: Zhenhua Wang

Reporter:: Zhenhua Wang

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 26/Oct/16 06:57

Updated:: 02/Nov/16 18:47

Resolved:: 01/Nov/16 13:13