[SPARK-6683] Handling feature scaling properly for GLMs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.3.0
Fix Version/s: 1.5.0
Component/s: MLlib
Labels:
None

Target Version/s:

1.5.0

Description

GeneralizedLinearAlgorithm can scale features. This has 2 effects:

improves optimization behavior (essentially always improves behavior in practice)
changes the optimal solution (often for the better in terms of standardizing feature importance)

Current problems:

Inefficient implementation: We make a rescaled copy of the data.
Surprising API: For algorithms which use feature scaling, users may get different solutions than with R or other libraries. (Note: Feature scaling could be handled without changing the solution.)
Inconsistent API: Not all algorithms have the same default for feature scaling, and not all expose the option.

This is a proposal discussed with mengxr for an "ideal" solution. This solution will require some breaking API changes, but I'd argue these are necessary for the long-term since it's the best API we have thought of.

Proposal:

Implementation: Change to avoid making a rescaled copy of the data (described below). No API issues here.
API:
- Hide featureScaling from API. (breaking change)
- Internally, handle feature scaling to improve optimization, but modify it so it does not change the optimal solution. (breaking change, in terms of algorithm behavior)
- Externally, users who want to rescale feature (to change the solution) should do that scaling as a preprocessing step.

Details on implementation:

GradientDescent could instead scale the step size separately for each feature (and adjust regularization as needed; see the PR linked above). This would require storing a vector of length numFeatures, rather than making a full copy of the data.
I haven't thought this through for LBFGS, but I hope dbtsai can weigh in here.

Attachments

Issue Links

Is contained by

SPARK-8522 Disable feature scaling in Linear and Logistic Regression

Resolved

is duplicated by

SPARK-6348 Enable useFeatureScaling in SVMWithSGD

Resolved

relates to

SPARK-7780 The intercept in LogisticRegressionWithLBFGS should not be regularized

Resolved

Activity

People

Assignee:: DB Tsai

Reporter:: Joseph K. Bradley

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 02/Apr/15 18:38

Updated:: 03/Aug/15 00:30

Resolved:: 03/Aug/15 00:30