[SPARK-13444] QuantileDiscretizer chooses bad splits on large DataFrames - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.6.0, 2.0.0
Fix Version/s: 1.6.2, 2.0.0
Component/s: MLlib
Labels:
None

Description

In certain circumstances, QuantileDiscretizer fails to calculate the correct splits and will instead split data into two bins regardless of the value specified in numBuckets.

For example, supposed dataset.count is 200 million. And we do

val discretizer = new QuantileDiscretizer().setNumBuckets(10)
... set output and input columns ...
val dataWithBins = discretizer.fit(dataset).transform(dataset)

In this case, dataWithBins will have only two distinct bins versus the expected 10.

Problem is in line 113 and 114 of QuantileDiscretizer.scala and can be fixed by changing line 113 like so:
before: val requiredSamples = math.max(numBins * numBins, 10000)
after: val requiredSamples = math.max(numBins * numBins, 10000.0)

Attachments

Issue Links

links to

[Github] Pull Request #11319 (oliverpierson)

[Github] Pull Request #11377 (oliverpierson)

[Github] Pull Request #11402 (oliverpierson)

Activity

People

Assignee:: Oliver Pierson

Reporter:: Oliver Pierson

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 23/Feb/16 03:49

Updated:: 07/Mar/16 09:48

Resolved:: 25/Feb/16 13:27