Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-1859

Linear, Ridge and Lasso Regressions with SGD yield unexpected results

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 0.9.1
    • None
    • MLlib
    • OS: Ubuntu Server 12.04 x64
      PySpark

    Description

      Issue:
      Linear Regression with SGD don't work as expected on any data, but lpsa.dat (example one).
      Ridge Regression with SGD sometimes works ok.
      Lasso Regression with SGD sometimes works ok.

      Code example (PySpark) based on http://spark.apache.org/docs/0.9.0/mllib-guide.html#linear-regression-2 :

      regression_example.py
      parsedData = sc.parallelize([
          array([2400., 1500.]),
          array([240., 150.]),
          array([24., 15.]),
          array([2.4, 1.5]),
          array([0.24, 0.15])
      ])
      
      # Build the model
      model = LinearRegressionWithSGD.train(parsedData)
      print model._coeffs
      

      So we have a line (f(X) = 1.6 * X) here. Fortunately, f(X) = X works!
      The resulting model has nan coeffs: array([ nan]).
      Furthermore, if you comment records line by line you will get:

      • [-1.55897475e+296] coeff (the first record is commented),
      • [-8.62115396e+104] coeff (the first two records are commented),
      • etc

      It looks like the implemented regression algorithms diverges somehow.

      I get almost the same results on Ridge and Lasso.

      I've also tested these inputs in scikit-learn and it works as expected there.

      However, I'm still not sure whether it's a bug or SGD 'feature'. Should I preprocess my datasets somehow?

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            frol Vlad Frolov
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment