Description
In certain circumstances, QuantileDiscretizer fails to calculate the correct splits and will instead split data into two bins regardless of the value specified in numBuckets.
For example, supposed dataset.count is 200 million. And we do
val discretizer = new QuantileDiscretizer().setNumBuckets(10)
... set output and input columns ...
val dataWithBins = discretizer.fit(dataset).transform(dataset)
In this case, dataWithBins will have only two distinct bins versus the expected 10.
Problem is in line 113 and 114 of QuantileDiscretizer.scala and can be fixed by changing line 113 like so:
before: val requiredSamples = math.max(numBins * numBins, 10000)
after: val requiredSamples = math.max(numBins * numBins, 10000.0)