GeneralizedLinearAlgorithm can scale features. This has 2 effects:
- improves optimization behavior (essentially always improves behavior in practice)
- changes the optimal solution (often for the better in terms of standardizing feature importance)
- Inefficient implementation: We make a rescaled copy of the data.
- Surprising API: For algorithms which use feature scaling, users may get different solutions than with R or other libraries. (Note: Feature scaling could be handled without changing the solution.)
- Inconsistent API: Not all algorithms have the same default for feature scaling, and not all expose the option.
This is a proposal discussed with Xiangrui Meng for an "ideal" solution. This solution will require some breaking API changes, but I'd argue these are necessary for the long-term since it's the best API we have thought of.
- Implementation: Change to avoid making a rescaled copy of the data (described below). No API issues here.
- Hide featureScaling from API. (breaking change)
- Internally, handle feature scaling to improve optimization, but modify it so it does not change the optimal solution. (breaking change, in terms of algorithm behavior)
- Externally, users who want to rescale feature (to change the solution) should do that scaling as a preprocessing step.
Details on implementation:
- GradientDescent could instead scale the step size separately for each feature (and adjust regularization as needed; see the PR linked above). This would require storing a vector of length numFeatures, rather than making a full copy of the data.
- I haven't thought this through for LBFGS, but I hope DB Tsai can weigh in here.