Details
-
Improvement
-
Status: In Progress
-
Minor
-
Resolution: Unresolved
-
3.1.0
-
None
-
None
Description
What changes were proposed in this pull request?
I have added a `method` parameter to `CrossValidator.scala` to allow the user to choose between repeated random sub-sampling cross-validation (current behavior) and k-fold cross-validation (optional new behavior). The default method is random sub-sampling cross-validation.
If k-fold cross-validation is chosen, the new behavior is as follows:
- Instead of splitting the input dataset into k training and validation sets, I split them into k folds; for each fold of training, one of the k splits is selected for validation, and the others are unioned together for training.
- Instead of caching each training and validation set k times, I cache each of the folds once.
- Instead of waiting for every model to finish training on fold n before moving on to fold n+1, new fold/model combinations will be trained as soon as resources are available.
- Instead of creating one `Future` per model for each fold in series, all `Future`s for each fold & parameter grid pair are created and trained in parallel.
- A new `Int` parameter is added to the `Future` (now `Future[Int, Double]` instead of `Future[Double]`) in order to keep track of which `Future` belongs to which parameter grid.
Why are the changes needed?
These changes allow the user to choose between repeated random sub-sampling cross-validation (current behavior) and k-fold cross-validation (optional new behavior). These changes:
1. allow the user to choose between two types of cross-validation.
2. (If k-fold is chosen) only require caching the entire dataset once (instead of k times in repeated random sub-sampling cross-validation, as it does now).
3. (if k-fold is chosen) free resources to train new model/fold combinations as soon as the previous one finishes. Currently, a model can only train one fold at a time. If k-fold is chosen, the added functionality will allow the `fit` to train multiple folds at once for the same model, and, in the case of a grid search, allow it to train multiple model/fold combinations at once, without needing to wait for the slowest model to fit the first fold before moving onto the second.
Does this PR introduce any user-facing change?
Yes. This PR introduces the `setMethod` method to `CrossValidator`. If the `method` parameter is not set, the behavior will be the same as it has always been.
How was this patch tested?
Unit tests will be added.