Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-1859

Linear, Ridge and Lasso Regressions with SGD yield unexpected results

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 0.9.1
    • None
    • MLlib
    • OS: Ubuntu Server 12.04 x64
      PySpark

    Description

      Issue:
      Linear Regression with SGD don't work as expected on any data, but lpsa.dat (example one).
      Ridge Regression with SGD sometimes works ok.
      Lasso Regression with SGD sometimes works ok.

      Code example (PySpark) based on http://spark.apache.org/docs/0.9.0/mllib-guide.html#linear-regression-2 :

      regression_example.py
      parsedData = sc.parallelize([
          array([2400., 1500.]),
          array([240., 150.]),
          array([24., 15.]),
          array([2.4, 1.5]),
          array([0.24, 0.15])
      ])
      
      # Build the model
      model = LinearRegressionWithSGD.train(parsedData)
      print model._coeffs
      

      So we have a line (f(X) = 1.6 * X) here. Fortunately, f(X) = X works!
      The resulting model has nan coeffs: array([ nan]).
      Furthermore, if you comment records line by line you will get:

      • [-1.55897475e+296] coeff (the first record is commented),
      • [-8.62115396e+104] coeff (the first two records are commented),
      • etc

      It looks like the implemented regression algorithms diverges somehow.

      I get almost the same results on Ridge and Lasso.

      I've also tested these inputs in scikit-learn and it works as expected there.

      However, I'm still not sure whether it's a bug or SGD 'feature'. Should I preprocess my datasets somehow?

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              frol Vlad Frolov
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: