Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-14489

RegressionEvaluator returns NaN for ALS in Spark ml

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.6.0
    • Fix Version/s: 2.2.0
    • Component/s: ML
    • Labels:
    • Environment:

      AWS EMR

    • Target Version/s:
    • Flags:
      Patch

      Description

      When building a Spark ML pipeline containing an ALS estimator, the metrics "rmse", "mse", "r2" and "mae" all return NaN.

      The reason is in CrossValidator.scala line 109. The K-folds are randomly generated. For large and sparse datasets, there is a significant probability that at least one user of the validation set is missing in the training set, hence generating a few NaN estimation with transform method and NaN RegressionEvaluator's metrics too.

      Suggestion to fix the bug: remove the NaN values while computing the rmse or other metrics (ie, removing users or items in validation test that is missing in the learning set). Send logs when this happen.

      Issue SPARK-14153 seems to be the same pbm

      Bar.scala
          val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
          splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
            val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
            val validationDataset = sqlCtx.createDataFrame(validation, schema).cache()
            // multi-model training
            logDebug(s"Train split $splitIndex with multiple sets of parameters.")
            val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
            trainingDataset.unpersist()
            var i = 0
            while (i < numModels) {
              // TODO: duplicate evaluator to take extra params from input
              val metric = eval.evaluate(models(i).transform(validationDataset, epm(i)))
              logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
              metrics(i) += metric
              i += 1
            }
            validationDataset.unpersist()
          }
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                mlnick Nicholas Pentreath
                Reporter:
                clemencb Boris Clémençon
              • Votes:
                2 Vote for this issue
                Watchers:
                11 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - 4h
                  4h
                  Remaining:
                  Remaining Estimate - 4h
                  4h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified