Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6683

Handling feature scaling properly for GLMs

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 1.3.0
    • 1.5.0
    • MLlib
    • None

    Description

      GeneralizedLinearAlgorithm can scale features. This has 2 effects:

      • improves optimization behavior (essentially always improves behavior in practice)
      • changes the optimal solution (often for the better in terms of standardizing feature importance)

      Current problems:

      • Inefficient implementation: We make a rescaled copy of the data.
      • Surprising API: For algorithms which use feature scaling, users may get different solutions than with R or other libraries. (Note: Feature scaling could be handled without changing the solution.)
      • Inconsistent API: Not all algorithms have the same default for feature scaling, and not all expose the option.

      This is a proposal discussed with mengxr for an "ideal" solution. This solution will require some breaking API changes, but I'd argue these are necessary for the long-term since it's the best API we have thought of.

      Proposal:

      • Implementation: Change to avoid making a rescaled copy of the data (described below). No API issues here.
      • API:
        • Hide featureScaling from API. (breaking change)
        • Internally, handle feature scaling to improve optimization, but modify it so it does not change the optimal solution. (breaking change, in terms of algorithm behavior)
        • Externally, users who want to rescale feature (to change the solution) should do that scaling as a preprocessing step.

      Details on implementation:

      • GradientDescent could instead scale the step size separately for each feature (and adjust regularization as needed; see the PR linked above). This would require storing a vector of length numFeatures, rather than making a full copy of the data.
      • I haven't thought this through for LBFGS, but I hope dbtsai can weigh in here.

      Attachments

        Issue Links

          Activity

            People

              dbtsai DB Tsai
              josephkb Joseph K. Bradley
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: