[SPARK-6480] histogram() bucket function is wrong in some simple edge cases - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.3.0
Fix Version/s: 1.2.2, 1.3.1, 1.4.0
Component/s: Spark Core
Labels:
None

Description

(Credit to a customer report here) This test would fail now:

    val rdd = sc.parallelize(Seq(1, 1, 1, 2, 3, 3))
    assert(Array(3, 1, 2) === rdd.map(_.toDouble).histogram(3)._2)

Because it returns 3, 1, 0. The problem ultimately traces to the 'fast' bucket function that judges buckets based on a multiple of the gap between first and second elements. Errors multiply and the end of the final bucket fails to include the max.

Fairly plausible use case actually.

This can be tightened up easily with a slightly better expression. It will also fix this test, which is actually expecting the wrong answer:

    val rdd = sc.parallelize(6 to 99)
    val (histogramBuckets, histogramResults) = rdd.histogram(9)
    val expectedHistogramResults =
      Array(11, 10, 11, 10, 10, 11, 10, 10, 11)

(Should be Array(11, 10, 10, 11, 10, 10, 11, 10, 11))

Attachments

Issue Links

links to

[Github] Pull Request #5148 (srowen)

Activity

People

Assignee:: Sean R. Owen

Reporter:: Sean R. Owen

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 23/Mar/15 23:17

Updated:: 26/Mar/15 15:01

Resolved:: 26/Mar/15 15:01