Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Not A Problem
-
0.9.1
-
None
-
OS: Ubuntu Server 12.04 x64
PySpark
Description
Issue:
Linear Regression with SGD don't work as expected on any data, but lpsa.dat (example one).
Ridge Regression with SGD sometimes works ok.
Lasso Regression with SGD sometimes works ok.
Code example (PySpark) based on http://spark.apache.org/docs/0.9.0/mllib-guide.html#linear-regression-2 :
parsedData = sc.parallelize([ array([2400., 1500.]), array([240., 150.]), array([24., 15.]), array([2.4, 1.5]), array([0.24, 0.15]) ]) # Build the model model = LinearRegressionWithSGD.train(parsedData) print model._coeffs
So we have a line (f(X) = 1.6 * X) here. Fortunately, f(X) = X works!
The resulting model has nan coeffs: array([ nan]).
Furthermore, if you comment records line by line you will get:
- [-1.55897475e+296] coeff (the first record is commented),
- [-8.62115396e+104] coeff (the first two records are commented),
- etc
It looks like the implemented regression algorithms diverges somehow.
I get almost the same results on Ridge and Lasso.
I've also tested these inputs in scikit-learn and it works as expected there.
However, I'm still not sure whether it's a bug or SGD 'feature'. Should I preprocess my datasets somehow?
Attachments
Attachments
Issue Links
- is related to
-
SPARK-1585 Not robust Lasso causes Infinity on weights and losses
- Closed