Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6480

histogram() bucket function is wrong in some simple edge cases

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.3.0
    • Fix Version/s: 1.2.2, 1.3.1, 1.4.0
    • Component/s: Spark Core
    • Labels:
      None

      Description

      (Credit to a customer report here) This test would fail now:

          val rdd = sc.parallelize(Seq(1, 1, 1, 2, 3, 3))
          assert(Array(3, 1, 2) === rdd.map(_.toDouble).histogram(3)._2)
      

      Because it returns 3, 1, 0. The problem ultimately traces to the 'fast' bucket function that judges buckets based on a multiple of the gap between first and second elements. Errors multiply and the end of the final bucket fails to include the max.

      Fairly plausible use case actually.

      This can be tightened up easily with a slightly better expression. It will also fix this test, which is actually expecting the wrong answer:

          val rdd = sc.parallelize(6 to 99)
          val (histogramBuckets, histogramResults) = rdd.histogram(9)
          val expectedHistogramResults =
            Array(11, 10, 11, 10, 10, 11, 10, 10, 11)
      

      (Should be Array(11, 10, 10, 11, 10, 10, 11, 10, 11))

        Attachments

          Activity

            People

            • Assignee:
              srowen Sean Owen
              Reporter:
              srowen Sean Owen
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: