Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18111

Wrong ApproximatePercentile answer when multiple records have the minimum value

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete CommentsDelete
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.1
    • 2.0.3, 2.1.0
    • SQL
    • None

    Description

      When multiple records have the minimum value, the answer of ApproximatePercentile is wrong.

      Suppose we have a table with 12 records and 4 partitions, values of column "col" in these partitions are:
      1, 1, 2
      1, 1, 3
      1, 1, 4
      1, 1, 5
      If we query percentile_approx(col, array(0.5)), the current answer is "5", which is far from the correct answer "1".

      The test case is as below:

        test("percentile_approx, multiple records with the minimum value in a partition") {
          withTempView(table) {
            spark.sparkContext.makeRDD(Seq(1, 1, 2, 1, 1, 3, 1, 1, 4, 1, 1, 5), 4).toDF("col")
              .createOrReplaceTempView(table)
            checkAnswer(
              spark.sql(s"SELECT percentile_approx(col, array(0.5)) FROM $table"),
              Row(Seq(1.0D))
            )
          }
        }
      

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            ZenWzh Zhenhua Wang Assign to me
            ZenWzh Zhenhua Wang
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment