[SPARK-14489] RegressionEvaluator returns NaN for ALS in Spark ml - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.6.0
Fix Version/s: 2.2.0
Component/s: ML
Labels:
- patch
Environment:

AWS EMR

Target Version/s:

2.2.0
Flags:

Patch

Description

When building a Spark ML pipeline containing an ALS estimator, the metrics "rmse", "mse", "r2" and "mae" all return NaN.

The reason is in CrossValidator.scala line 109. The K-folds are randomly generated. For large and sparse datasets, there is a significant probability that at least one user of the validation set is missing in the training set, hence generating a few NaN estimation with transform method and NaN RegressionEvaluator's metrics too.

Suggestion to fix the bug: remove the NaN values while computing the rmse or other metrics (ie, removing users or items in validation test that is missing in the learning set). Send logs when this happen.

Issue ~~SPARK-14153~~ seems to be the same pbm

Bar.scala

    val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
    splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
      val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
      val validationDataset = sqlCtx.createDataFrame(validation, schema).cache()
      // multi-model training
      logDebug(s"Train split $splitIndex with multiple sets of parameters.")
      val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
      trainingDataset.unpersist()
      var i = 0
      while (i < numModels) {
        // TODO: duplicate evaluator to take extra params from input
        val metric = eval.evaluate(models(i).transform(validationDataset, epm(i)))
        logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
        metrics(i) += metric
        i += 1
      }
      validationDataset.unpersist()
    }

Attachments

Issue Links

is depended upon by

SPARK-19345 Add doc for "coldStartStrategy" usage in ALS

Resolved

is duplicated by

SPARK-14153 My dataset does not provide proper predictions in ALS

Resolved

is related to

SPARK-13857 Feature parity for ALS ML with MLLIB

Closed

SPARK-14409 Investigate adding a RankingEvaluator to ML

Resolved

relates to

SPARK-19346 Add further cold-start strategies for ALS prediction

Resolved

links to

[Github] Pull Request #12577 (MLnick)

[Github] Pull Request #12896 (MLnick)

(2 links to)

Activity

People

Assignee:: Nicholas Pentreath

Reporter:: Boris Clémençon

Votes:: 2 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 08/Apr/16 11:26

Updated:: 28/Feb/17 14:18

Resolved:: 28/Feb/17 14:18

Time Tracking

Estimated:

Remaining:

Logged:

Not Specified