Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6683

Handling feature scaling properly for GLMs

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 1.3.0
    • 1.5.0
    • MLlib
    • None

    Description

      GeneralizedLinearAlgorithm can scale features. This has 2 effects:

      • improves optimization behavior (essentially always improves behavior in practice)
      • changes the optimal solution (often for the better in terms of standardizing feature importance)

      Current problems:

      • Inefficient implementation: We make a rescaled copy of the data.
      • Surprising API: For algorithms which use feature scaling, users may get different solutions than with R or other libraries. (Note: Feature scaling could be handled without changing the solution.)
      • Inconsistent API: Not all algorithms have the same default for feature scaling, and not all expose the option.

      This is a proposal discussed with Xiangrui Meng for an "ideal" solution. This solution will require some breaking API changes, but I'd argue these are necessary for the long-term since it's the best API we have thought of.

      Proposal:

      • Implementation: Change to avoid making a rescaled copy of the data (described below). No API issues here.
      • API:
        • Hide featureScaling from API. (breaking change)
        • Internally, handle feature scaling to improve optimization, but modify it so it does not change the optimal solution. (breaking change, in terms of algorithm behavior)
        • Externally, users who want to rescale feature (to change the solution) should do that scaling as a preprocessing step.

      Details on implementation:

      • GradientDescent could instead scale the step size separately for each feature (and adjust regularization as needed; see the PR linked above). This would require storing a vector of length numFeatures, rather than making a full copy of the data.
      • I haven't thought this through for LBFGS, but I hope DB Tsai can weigh in here.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            dbtsai DB Tsai
            josephkb Joseph K. Bradley
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment