Hello Spark maintainers,
I was experimenting with my own implementation of the space-efficient quantile algorithm in another language and I was using the Spark's one as a reference.
In my analysis, I believe to have found an issue with the merge() logic. Here is some simple Scala code that reproduces the issue I've found:
I query for all possible quantiles in a 100-element array with a desired 10% max error. In this scenario, one would expect to observe a maximum error of 10 ranks or less (10% of 100). However, the output I observe is:
The variance is probably due to non-deterministic operations behind the scenes, but irrelevant to the core cause. (and sorry for my Scala, I'm not used to it)
Interestingly enough, if I change from five to one partition the code works as expected and gives 10 every time. This seems to point to some problem at the merge logic
The original authors (Sean Zhong and Wenchen Fan for what I could dig from the history) suggest the published paper is not clear on how that should be done and, honestly, I was not confident in the current approach either.
SPARK-21184 that reports the same problem, but it was unfortunately closed with no fix applied.
In my external implementation I believe to have found a sound way to implement the merge method. Here is my take in Rust, if relevant
I'd be really glad to add unit tests and contribute my implementation adapted to Scala.
I'd love to hear your opinion on the matter.