Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6683

Handling feature scaling properly for GLMs

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments


    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 1.3.0
    • 1.5.0
    • MLlib
    • None


      GeneralizedLinearAlgorithm can scale features. This has 2 effects:

      • improves optimization behavior (essentially always improves behavior in practice)
      • changes the optimal solution (often for the better in terms of standardizing feature importance)

      Current problems:

      • Inefficient implementation: We make a rescaled copy of the data.
      • Surprising API: For algorithms which use feature scaling, users may get different solutions than with R or other libraries. (Note: Feature scaling could be handled without changing the solution.)
      • Inconsistent API: Not all algorithms have the same default for feature scaling, and not all expose the option.

      This is a proposal discussed with Xiangrui Meng for an "ideal" solution. This solution will require some breaking API changes, but I'd argue these are necessary for the long-term since it's the best API we have thought of.


      • Implementation: Change to avoid making a rescaled copy of the data (described below). No API issues here.
      • API:
        • Hide featureScaling from API. (breaking change)
        • Internally, handle feature scaling to improve optimization, but modify it so it does not change the optimal solution. (breaking change, in terms of algorithm behavior)
        • Externally, users who want to rescale feature (to change the solution) should do that scaling as a preprocessing step.

      Details on implementation:

      • GradientDescent could instead scale the step size separately for each feature (and adjust regularization as needed; see the PR linked above). This would require storing a vector of length numFeatures, rather than making a full copy of the data.
      • I haven't thought this through for LBFGS, but I hope DB Tsai can weigh in here.


        Issue Links


          This comment will be Viewable by All Users Viewable by All Users


            dbtsai DB Tsai
            josephkb Joseph K. Bradley
            0 Vote for this issue
            4 Start watching this issue




                Issue deployment