Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18715

Fix wrong AIC calculation in Binomial GLM

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.2
    • 2.2.0
    • ML
    • Important

    Description

      The AIC calculation in Binomial GLM seems to be wrong when there are weights. The result is different from that in R.

      The current implementation is:

            -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) =>
              weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt)
            }.sum()
      

      Suggest changing this to

            -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) =>
              val wt = math.round(weight).toInt
              if (wt == 0){
                0.0
              } else {
                dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt)
              }
            }.sum()
      


      The following is an example to illustrate the problem.

      val dataset = Seq(
            LabeledPoint(0.0, Vectors.dense(18, 1.0)),
            LabeledPoint(0.5, Vectors.dense(12, 0.0)),
            LabeledPoint(1.0, Vectors.dense(15, 0.0)),
            LabeledPoint(0.0, Vectors.dense(13, 2.0)),
            LabeledPoint(0.0, Vectors.dense(15, 1.0)),
            LabeledPoint(0.5, Vectors.dense(16, 1.0))
          ).toDF().withColumn("weight", col("label") + 1.0)
      val glr = new GeneralizedLinearRegression()
          .setFamily("binomial")
          .setWeightCol("weight")
          .setRegParam(0)
      val model = glr.fit(dataset)
      model.summary.aic
      

      This calculation shows the AIC is 14.189026847171382. To verify whether this is correct, I run the same analysis in R but got AIC = 11.66092, -2 * LogLik = 5.660918.

      da <- scan(, what=list(y = 0, x1 = 0, x2 = 0, w = 0), sep = ",")
      0,18,1,1
      0.5,12,0,1.5
      1,15,0,2
      0,13,2,1
      0,15,1,1
      0.5,16,1,1.5
      da <- as.data.frame(da)
      f <- glm(y ~ x1 + x2 , data = da, family = binomial(), weight = w)
      AIC(f)
      -2 * logLik(f)
      

      Now, I check whether the proposed change is correct. The following calculates -2 * LogLik manually and get 5.6609177228379055, the same as that in R.

      val predictions = model.transform(dataset)
      -2.0 * predictions.select("label", "prediction", "weight").rdd.map {case Row(y: Double, mu: Double, weight: Double) =>
            val wt = math.round(weight).toInt
            if (wt == 0){
              0.0
            } else {
              dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt)
            }
        }.sum()
      

      Attachments

        Activity

          People

            actuaryzhang Wayne Zhang
            actuaryzhang Wayne Zhang
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 120h
                120h
                Remaining:
                Remaining Estimate - 120h
                120h
                Logged:
                Time Spent - Not Specified
                Not Specified