Description
For consistency and code reuse, QuantileDiscretizer should use approxQuantile to find splits in the data rather than implement it's own method.
Additionally, making this change should remedy a bug where QuantileDiscretizer fails to calculate the correct splits in certain circumstances, resulting in an incorrect number of buckets/bins.
E.g.
val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x")
val discretizer = new QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5)
discretizer.fit(df).getSplits
gives:
Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity)
which corresponds to 6 buckets (not 5).
Attachments
Issue Links
- relates to
-
SPARK-10785 Scale QuantileDiscretizer using distributed binning
- Closed
- supercedes
-
SPARK-10785 Scale QuantileDiscretizer using distributed binning
- Closed
- links to