Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18111

Wrong ApproximatePercentile answer when multiple records have the minimum value

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.1
    • 2.0.3, 2.1.0
    • SQL
    • None

    Description

      When multiple records have the minimum value, the answer of ApproximatePercentile is wrong.

      Suppose we have a table with 12 records and 4 partitions, values of column "col" in these partitions are:
      1, 1, 2
      1, 1, 3
      1, 1, 4
      1, 1, 5
      If we query percentile_approx(col, array(0.5)), the current answer is "5", which is far from the correct answer "1".

      The test case is as below:

        test("percentile_approx, multiple records with the minimum value in a partition") {
          withTempView(table) {
            spark.sparkContext.makeRDD(Seq(1, 1, 2, 1, 1, 3, 1, 1, 4, 1, 1, 5), 4).toDF("col")
              .createOrReplaceTempView(table)
            checkAnswer(
              spark.sql(s"SELECT percentile_approx(col, array(0.5)) FROM $table"),
              Row(Seq(1.0D))
            )
          }
        }
      

      Attachments

        Issue Links

          Activity

            People

              ZenWzh Zhenhua Wang
              ZenWzh Zhenhua Wang
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: