[SPARK-13600] Use approxQuantile from DataFrame stats in QuantileDiscretizer - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.6.0, 2.0.0
Fix Version/s: 2.0.0
Component/s: MLlib
Labels:
None

Description

For consistency and code reuse, QuantileDiscretizer should use approxQuantile to find splits in the data rather than implement it's own method.

Additionally, making this change should remedy a bug where QuantileDiscretizer fails to calculate the correct splits in certain circumstances, resulting in an incorrect number of buckets/bins.

E.g.

val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x")
val discretizer = new QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5)
discretizer.fit(df).getSplits

gives:
Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity)
which corresponds to 6 buckets (not 5).

Attachments

Issue Links

relates to

SPARK-10785 Scale QuantileDiscretizer using distributed binning

Closed

supercedes

SPARK-10785 Scale QuantileDiscretizer using distributed binning

Closed

links to

[Github] Pull Request #11553 (oliverpierson)

Activity

People

Assignee:: Oliver Pierson

Reporter:: Oliver Pierson

Shepherd:: Nicholas Pentreath

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 01/Mar/16 18:46

Updated:: 11/Apr/16 19:03

Resolved:: 11/Apr/16 19:03