Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
2.4.3
-
None
-
None
Description
The rdd predictionAndLabels in ml.evaluation.MulticlassificationEvaluator.evaluate() needs to be persisted. When MulticlassMetrics uses predictionAndLabels to initialize fileds, there will be at least five actions executed on predictionAndLabels.
override def evaluate(dataset: Dataset[_]): Double = { val schema = dataset.schema SchemaUtils.checkColumnType(schema, $(predictionCol), DoubleType) SchemaUtils.checkNumericType(schema, $(labelCol)) // Needs to be persisted val predictionAndLabels = dataset.select(col($(predictionCol)), col($(labelCol)).cast(DoubleType)).rdd.map { case Row(prediction: Double, label: Double) => (prediction, label) } // The initialization will use predictionAndLabels multi times in different actions. val metrics = new MulticlassMetrics(predictionAndLabels) val metric = $(metricName) match { case "f1" => metrics.weightedFMeasure case "weightedPrecision" => metrics.weightedPrecision case "weightedRecall" => metrics.weightedRecall case "accuracy" => metrics.accuracy } metric }
This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses.
Attachments
Issue Links
- duplicates
-
SPARK-29818 Missing persist on RDD
- Resolved