[SPARK-29336] The implementation of QuantileSummaries.merge does not guarantee that the relativeError will be respected - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.4.3
Fix Version/s: 3.0.0
Component/s: SQL
Labels:
- pull-request-available

Description

Hello Spark maintainers,

I was experimenting with my own implementation of the space-efficient quantile algorithm in another language and I was using the Spark's one as a reference.

In my analysis, I believe to have found an issue with the merge() logic. Here is some simple Scala code that reproduces the issue I've found:

var values = (1 to 100).toArray
val all_quantiles = values.indices.map(i => (i+1).toDouble / values.length).toArray
for (n <- 0 until 5) {
  var df = spark.sparkContext.makeRDD(values).toDF("value").repartition(5)
  val all_answers = df.stat.approxQuantile("value", all_quantiles, 0.1)
  val all_answered_ranks = all_answers.map(ans => values.indexOf(ans)).toArray
  val error = all_answered_ranks.zipWithIndex.map({ case (answer, expected) => Math.abs(expected - answer) }).toArray
  val max_error = error.max
  print(max_error + "\n")
}

I query for all possible quantiles in a 100-element array with a desired 10% max error. In this scenario, one would expect to observe a maximum error of 10 ranks or less (10% of 100). However, the output I observe is:

The variance is probably due to non-deterministic operations behind the scenes, but irrelevant to the core cause. (and sorry for my Scala, I'm not used to it)

Interestingly enough, if I change from five to one partition the code works as expected and gives 10 every time. This seems to point to some problem at the merge logic

The original authors (clockfly and cloud_fan for what I could dig from the history) suggest the published paper is not clear on how that should be done and, honestly, I was not confident in the current approach either.

I've found ~~SPARK-21184~~ that reports the same problem, but it was unfortunately closed with no fix applied.

In my external implementation I believe to have found a sound way to implement the merge method. Here is my take in Rust, if relevant
I'd be really glad to add unit tests and contribute my implementation adapted to Scala.
I'd love to hear your opinion on the matter.
Best regards

Attachments

Issue Links

is duplicated by

SPARK-29325 approxQuantile() results are incorrect and vary significantly for small changes in relativeError

Resolved

links to

GitHub Pull Request #26029

Activity

People

Assignee:: Guilherme Souza

Reporter:: Guilherme Souza

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 02/Oct/19 20:31

Updated:: 4 days ago 13:22

Resolved:: 08/Oct/19 13:12