Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-11581

Example mllib code in documentation incorrectly computes MSE

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Trivial
    • Resolution: Fixed
    • 1.3.1, 1.4.1, 1.5.1, 1.6.0
    • 1.4.2, 1.5.3, 1.6.0
    • Documentation

    Description

      The example Java code at the bottom of the mllib-decision-tree web page shows how to compute MSE on the test data. However, there is a bug in the code. The code currently divides by data.count(), but it should instead divide by the count of testData, testData.count().

      http://spark.apache.org/docs/latest/mllib-decision-tree.html

      Double testMSE =
      predictionAndLabel.map(new Function<Tuple2<Double, Double>, Double>() {
      @Override
      public Double call(Tuple2<Double, Double> pl)

      { Double diff = pl._1() - pl._2(); return diff * diff; }

      }).reduce(new Function2<Double, Double, Double>() {
      @Override
      public Double call(Double a, Double b)

      { return a + b; }

      }) / data.count();
      System.out.println("Test Mean Squared Error: " + testMSE);

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            bharat1 M Bharat lal
            bwebb Brian Webb
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Issue deployment