Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29336

The implementation of QuantileSummaries.merge does not guarantee that the relativeError will be respected

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.4.3
    • 3.0.0
    • SQL

    Description

      Hello Spark maintainers,

      I was experimenting with my own implementation of the space-efficient quantile algorithm in another language and I was using the Spark's one as a reference.

      In my analysis, I believe to have found an issue with the merge() logic. Here is some simple Scala code that reproduces the issue I've found:

       

      var values = (1 to 100).toArray
      val all_quantiles = values.indices.map(i => (i+1).toDouble / values.length).toArray
      for (n <- 0 until 5) {
        var df = spark.sparkContext.makeRDD(values).toDF("value").repartition(5)
        val all_answers = df.stat.approxQuantile("value", all_quantiles, 0.1)
        val all_answered_ranks = all_answers.map(ans => values.indexOf(ans)).toArray
        val error = all_answered_ranks.zipWithIndex.map({ case (answer, expected) => Math.abs(expected - answer) }).toArray
        val max_error = error.max
        print(max_error + "\n")
      }
      

      I query for all possible quantiles in a 100-element array with a desired 10% max error. In this scenario, one would expect to observe a maximum error of 10 ranks or less (10% of 100). However, the output I observe is:

       

      16
      12
      10
      11
      17

      The variance is probably due to non-deterministic operations behind the scenes, but irrelevant to the core cause. (and sorry for my Scala, I'm not used to it)

      Interestingly enough, if I change from five to one partition the code works as expected and gives 10 every time. This seems to point to some problem at the merge logic

      The original authors (clockfly and cloud_fan for what I could dig from the history) suggest the published paper is not clear on how that should be done and, honestly, I was not confident in the current approach either.

      I've found SPARK-21184 that reports the same problem, but it was unfortunately closed with no fix applied.

      In my external implementation I believe to have found a sound way to implement the merge method. Here is my take in Rust, if relevant
      I'd be really glad to add unit tests and contribute my implementation adapted to Scala.
      I'd love to hear your opinion on the matter.
      Best regards

       

       

      Attachments

        Issue Links

          Activity

            People

              sitegui Guilherme Souza
              sitegui Guilherme Souza
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: