Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29325

approxQuantile() results are incorrect and vary significantly for small changes in relativeError

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.3.2, 2.4.4
    • None
    • SQL
    • I was using OSX 10.14.6.

      I was using Scala 2.11.12 and Spark 2.4.4.

      I also verified the bug exists for Scala 2.11.8 and Spark 2.3.2.

    Description

      The approxQuantile() method returns sometimes incorrect results that are sensitively dependent upon the choice of the relativeError.

      Below is an example in the latest Spark version (2.4.4). You can see the result varies significantly for modest changes in the specified relativeError parameter. The result varies much more than would be expected based upon the relativeError parameter.

       

      Welcome to
            ____              __
           / __/__  ___ _____/ /__
          _\ \/ _ \/ _ `/ __/  '_/
         /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
            /_/
               
      Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_212)
      Type in expressions to have them evaluated.
      Type :help for more information.
      
      
      scala> val df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("./20191001_example_data_approx_quantile_bug")
      df: org.apache.spark.sql.DataFrame = [value: double]
      
      
      scala> df.stat.approxQuantile("value", Array(0.9), 0)
      res0: Array[Double] = Array(0.5929591082174609)
      
      
      scala> df.stat.approxQuantile("value", Array(0.9), 0.001)
      res1: Array[Double] = Array(0.67621027121925)
      
      
      scala> df.stat.approxQuantile("value", Array(0.9), 0.002)
      res2: Array[Double] = Array(0.5926195654486178)
      
      
      scala> df.stat.approxQuantile("value", Array(0.9), 0.003)
      res3: Array[Double] = Array(0.5924693999048418)
      
      
      scala> df.stat.approxQuantile("value", Array(0.9), 0.004)
      res4: Array[Double] = Array(0.67621027121925)
      
      
      scala> df.stat.approxQuantile("value", Array(0.9), 0.005)
      res5: Array[Double] = Array(0.5923925937051544) 
      

      I attached a zip file containing the data used for the above example demonstrating the bug.

      Also, the following demonstrates that there is data for intermediate quantile values between the 0.5926195654486178 and 0.67621027121925 values observed above.

      scala> df.stat.approxQuantile("value", Array(0.9), 0.0)
      res10: Array[Double] = Array(0.5929591082174609)
      
      scala> df.stat.approxQuantile("value", Array(0.91), 0.0)
      res11: Array[Double] = Array(0.5966354540849995)
      
      scala> df.stat.approxQuantile("value", Array(0.92), 0.0)
      res12: Array[Double] = Array(0.6015676591185595)
      
      scala> df.stat.approxQuantile("value", Array(0.93), 0.0)
      res13: Array[Double] = Array(0.6029240823799614)
      
      scala> df.stat.approxQuantile("value", Array(0.94), 0.0)
      res14: Array[Double] = Array(0.6117645471000034)
      
      scala> df.stat.approxQuantile("value", Array(0.95), 0.0)
      res15: Array[Double] = Array(0.6185162204274052)
      
      scala> df.stat.approxQuantile("value", Array(0.96), 0.0)
      res16: Array[Double] = Array(0.625983000807062)
      
      scala> df.stat.approxQuantile("value", Array(0.97), 0.0)
      res17: Array[Double] = Array(0.6306892943412258)
      
      scala> df.stat.approxQuantile("value", Array(0.98), 0.0)
      res18: Array[Double] = Array(0.6365567375994333)
      
      scala> df.stat.approxQuantile("value", Array(0.99), 0.0)
      res19: Array[Double] = Array(0.6554479197566019)
      

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            jverbus James Verbus
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment