Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32908

percentile_approx() returns incorrect results

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.4, 2.4.7, 3.0.1, 3.1.0
    • 2.4.8, 3.0.2, 3.1.0
    • SQL

    Description

      Read input data from the attached CSV file:

            val df = spark.read.option("header", "true")
              .option("inferSchema", "true")
              .csv("/Users/maximgekk/tmp/percentile_approx-input.csv")
              .repartition(1)
            df.createOrReplaceTempView(table)
      

      Calculate the 0.77 percentile with accuracy 1e-05:

            spark.sql(
              s"""SELECT
                 |  percentile_approx(tr_rat_resampling_score, 0.77, 100000)
                 |FROM $table
                 """.stripMargin).show
      
      +------------------------------------------------------------------------+
      |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 100000)|
      +------------------------------------------------------------------------+
      |                                                                    1000|
      +------------------------------------------------------------------------+
      

       The same for smaller accuracy 0.001:

      +----------------------------------------------------------------------+
      |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 1000)|
      +----------------------------------------------------------------------+
      |                                                                    18|
      +----------------------------------------------------------------------+
      

      and better accuracy 1e-06:

      +-------------------------------------------------------------------------+
      |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 1000000)|
      +-------------------------------------------------------------------------+
      |                                                                       17|
      +-------------------------------------------------------------------------+
      

      For the accuracy 1e-05, the result must be around 17-18 but not 1000.

      Here is percentile calculation in Google Sheets for the same input:
      https://docs.google.com/spreadsheets/d/1Y1i4Td6s9jZQ-bD4IRTESLXP3UxKpqJSXGtmx0Q5TA0/edit?usp=sharing

      Attachments

        1. percentile_approx-input.csv
          380 kB
          Max Gekk

        Issue Links

          Activity

            People

              maxgekk Max Gekk
              maxgekk Max Gekk
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: