Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.3.0
-
None
Description
(Credit to a customer report here) This test would fail now:
val rdd = sc.parallelize(Seq(1, 1, 1, 2, 3, 3))
assert(Array(3, 1, 2) === rdd.map(_.toDouble).histogram(3)._2)
Because it returns 3, 1, 0. The problem ultimately traces to the 'fast' bucket function that judges buckets based on a multiple of the gap between first and second elements. Errors multiply and the end of the final bucket fails to include the max.
Fairly plausible use case actually.
This can be tightened up easily with a slightly better expression. It will also fix this test, which is actually expecting the wrong answer:
val rdd = sc.parallelize(6 to 99) val (histogramBuckets, histogramResults) = rdd.histogram(9) val expectedHistogramResults = Array(11, 10, 11, 10, 10, 11, 10, 10, 11)
(Should be Array(11, 10, 10, 11, 10, 10, 11, 10, 11))
Attachments
Issue Links
- links to