[SPARK-21184] QuantileSummaries implementation is wrong and QuantileSummariesSuite fails with larger n - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.1.1
Fix Version/s: None
Component/s: SQL
Labels:
- bulk-closed

Description

1. QuantileSummaries implementation does not match the paper it is supposed to be based on.

1a. The compress method (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L240) merges neighboring buckets, but thats not what the paper says to do. The paper (http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf) describes an implicit tree structure and the compress method deletes selected subtrees.

1b. The paper does not discuss merging these summary data structures at all. The following comment is in the merge method of QuantileSummaries:

// The GK algorithm is a bit unclear about it, but it seems there is no need to adjust the
// statistics during the merging: the invariants are still respected after the merge.

Unless I'm missing something that needs substantiation, it's not clear that that the invariants hold.

2. QuantileSummariesSuite fails with n = 10000 (and other non trivial values)
https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/QuantileSummariesSuite.scala#L27

One possible solution if these issues can't be resolved would be to move to an algorithm that explicitly supports merging and is well tested like https://github.com/tdunning/t-digest

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Andrew Ray

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 22/Jun/17 20:33

Updated:: 21/May/19 04:17

Resolved:: 21/May/19 04:17