Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29818 Missing persist on RDD
  3. SPARK-29812

Missing persist on predictionAndLabels in MulticlassClassificationEvaluator

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.4.3
    • None
    • ML
    • None

    Description

      The rdd predictionAndLabels in ml.evaluation.MulticlassificationEvaluator.evaluate() needs to be persisted. When MulticlassMetrics uses predictionAndLabels to initialize fileds, there will be at least five actions executed on predictionAndLabels.

        override def evaluate(dataset: Dataset[_]): Double = {
          val schema = dataset.schema
          SchemaUtils.checkColumnType(schema, $(predictionCol), DoubleType)
          SchemaUtils.checkNumericType(schema, $(labelCol))
          // Needs to be persisted
          val predictionAndLabels =
            dataset.select(col($(predictionCol)), col($(labelCol)).cast(DoubleType)).rdd.map {
              case Row(prediction: Double, label: Double) => (prediction, label)
            }
          // The initialization will use predictionAndLabels multi times in different actions.
          val metrics = new MulticlassMetrics(predictionAndLabels)
          val metric = $(metricName) match {
            case "f1" => metrics.weightedFMeasure
            case "weightedPrecision" => metrics.weightedPrecision
            case "weightedRecall" => metrics.weightedRecall
            case "accuracy" => metrics.accuracy
          }
          metric
        }
      

      This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              spark_cachecheck IcySanwitch
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: