[SPARK-7132] Add fit with validation set to spark.ml GBT - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.4.0
Component/s: ML
Labels:
None

Description

In spark.mllib GradientBoostedTrees, we have a method runWithValidation which takes a validation set. We should add that to the spark.ml API.

This will require a bit of thinking about how the Pipelines API should handle a validation set (since Transformers and Estimators only take 1 input DataFrame). The current plan is to include an extra column in the input DataFrame which indicates whether the row is for training, validation, etc.

Goals
A [P0] Support efficient validation during training
B [P1] Support early stopping based on validation metrics
C [P0] Ensure validation data are preprocessed identically to training data
D [P1] Support complex Pipelines with multiple models using validation data

Proposal: column with indicator for train vs validation
Include an extra column in the input DataFrame which indicates whether the row is for training or validation. Add a Param “validationFlagCol” used to specify the extra column name.

A, B, C are easy.
D is doable.
Each estimator would need to have its validationFlagCol Param set to the same column.
Complication: It would be ideal if we could prevent different estimators from using different validation sets. (Joseph: There is not an obvious way IMO. Maybe we can address this later by, e.g., having Pipelines take a validationFlagCol Param and pass that to the sub-models in the Pipeline. Let’s not worry about this for now.)

Attachments

Issue Links

is blocked by

SPARK-6113 Stabilize DecisionTree and ensembles APIs

Resolved

Is contained by

SPARK-14047 GBT improvement umbrella

Resolved

is required by

SPARK-14376 spark.ml parity for trees

Resolved

relates to

SPARK-24333 Add fit with validation set to spark.ml GBT: Python API

Resolved

SPARK-14682 Provide evaluateEachIteration method or equivalent for spark.ml GBTs

Resolved

SPARK-7770 Change GBT validationTol to be relative tolerance

Resolved

links to

[Github] Pull Request #21129 (WeichenXu123)

proposal discussion

(1 relates to, 2 links to)

Activity

People

Assignee:: Weichen Xu

Reporter:: Joseph K. Bradley

Shepherd:: Joseph K. Bradley

Votes:: 7 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 24/Apr/15 19:32

Updated:: 21/May/18 20:06

Resolved:: 21/May/18 20:05