[SPARK-32271] Add option for k-fold cross-validation to CrossValidator - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: In Progress
Priority: Minor
Resolution: Unresolved
Affects Version/s: 3.1.0
Fix Version/s: None
Component/s: ML
Labels:
None

Description

What changes were proposed in this pull request?

I have added a `method` parameter to `CrossValidator.scala` to allow the user to choose between repeated random sub-sampling cross-validation (current behavior) and k-fold cross-validation (optional new behavior). The default method is random sub-sampling cross-validation.

If k-fold cross-validation is chosen, the new behavior is as follows:

Instead of splitting the input dataset into k training and validation sets, I split them into k folds; for each fold of training, one of the k splits is selected for validation, and the others are unioned together for training.
Instead of caching each training and validation set k times, I cache each of the folds once.
Instead of waiting for every model to finish training on fold n before moving on to fold n+1, new fold/model combinations will be trained as soon as resources are available.
Instead of creating one `Future` per model for each fold in series, all `Future`s for each fold & parameter grid pair are created and trained in parallel.
A new `Int` parameter is added to the `Future` (now `Future[Int, Double]` instead of `Future[Double]`) in order to keep track of which `Future` belongs to which parameter grid.

Why are the changes needed?

These changes allow the user to choose between repeated random sub-sampling cross-validation (current behavior) and k-fold cross-validation (optional new behavior). These changes:
1. allow the user to choose between two types of cross-validation.
2. (If k-fold is chosen) only require caching the entire dataset once (instead of k times in repeated random sub-sampling cross-validation, as it does now).
3. (if k-fold is chosen) free resources to train new model/fold combinations as soon as the previous one finishes. Currently, a model can only train one fold at a time. If k-fold is chosen, the added functionality will allow the `fit` to train multiple folds at once for the same model, and, in the case of a grid search, allow it to train multiple model/fold combinations at once, without needing to wait for the slowest model to fit the first fold before moving onto the second.

Does this PR introduce any user-facing change?

Yes. This PR introduces the `setMethod` method to `CrossValidator`. If the `method` parameter is not set, the behavior will be the same as it has always been.

How was this patch tested?

Unit tests will be added.

Attachments

Issue Links

links to

[Github] Pull Request #29080 (adjordan)

Activity

People

Assignee:: Unassigned

Reporter:: Austin Jordan

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 10/Jul/20 09:33

Updated:: 15/Jul/20 08:25