Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32271

Add option for k-fold cross-validation to CrossValidator

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Minor
    • Resolution: Unresolved
    • 3.1.0
    • None
    • ML
    • None

    Description

      What changes were proposed in this pull request?

      I have added a `method` parameter to `CrossValidator.scala` to allow the user to choose between repeated random sub-sampling cross-validation (current behavior) and k-fold cross-validation (optional new behavior). The default method is random sub-sampling cross-validation.

      If k-fold cross-validation is chosen, the new behavior is as follows:

      1. Instead of splitting the input dataset into k training and validation sets, I split them into k folds; for each fold of training, one of the k splits is selected for validation, and the others are unioned together for training.
      2. Instead of caching each training and validation set k times, I cache each of the folds once.
      3. Instead of waiting for every model to finish training on fold n before moving on to fold n+1, new fold/model combinations will be trained as soon as resources are available.
      4. Instead of creating one `Future` per model for each fold in series, all `Future`s for each fold & parameter grid pair are created and trained in parallel.
      5. A new `Int` parameter is added to the `Future` (now `Future[Int, Double]` instead of `Future[Double]`) in order to keep track of which `Future` belongs to which parameter grid.

      Why are the changes needed?

      These changes allow the user to choose between repeated random sub-sampling cross-validation (current behavior) and k-fold cross-validation (optional new behavior). These changes:
      1. allow the user to choose between two types of cross-validation.
      2. (If k-fold is chosen) only require caching the entire dataset once (instead of k times in repeated random sub-sampling cross-validation, as it does now).
      3. (if k-fold is chosen) free resources to train new model/fold combinations as soon as the previous one finishes. Currently, a model can only train one fold at a time. If k-fold is chosen, the added functionality will allow the `fit` to train multiple folds at once for the same model, and, in the case of a grid search, allow it to train multiple model/fold combinations at once, without needing to wait for the slowest model to fit the first fold before moving onto the second.

      Does this PR introduce any user-facing change?

      Yes. This PR introduces the `setMethod` method to `CrossValidator`. If the `method` parameter is not set, the behavior will be the same as it has always been.

      How was this patch tested?

      Unit tests will be added.

      Attachments

        Activity

          People

            Unassigned Unassigned
            adjordan Austin Jordan
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: