Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6480

histogram() bucket function is wrong in some simple edge cases

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.3.0
    • 1.2.2, 1.3.1, 1.4.0
    • Spark Core
    • None

    Description

      (Credit to a customer report here) This test would fail now:

          val rdd = sc.parallelize(Seq(1, 1, 1, 2, 3, 3))
          assert(Array(3, 1, 2) === rdd.map(_.toDouble).histogram(3)._2)
      

      Because it returns 3, 1, 0. The problem ultimately traces to the 'fast' bucket function that judges buckets based on a multiple of the gap between first and second elements. Errors multiply and the end of the final bucket fails to include the max.

      Fairly plausible use case actually.

      This can be tightened up easily with a slightly better expression. It will also fix this test, which is actually expecting the wrong answer:

          val rdd = sc.parallelize(6 to 99)
          val (histogramBuckets, histogramResults) = rdd.histogram(9)
          val expectedHistogramResults =
            Array(11, 10, 11, 10, 10, 11, 10, 10, 11)
      

      (Should be Array(11, 10, 10, 11, 10, 10, 11, 10, 11))

      Attachments

        Activity

          People

            srowen Sean R. Owen
            srowen Sean R. Owen
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: