Linear Regression with SGD don't work as expected on any data, but lpsa.dat (example one).
Ridge Regression with SGD sometimes works ok.
Lasso Regression with SGD sometimes works ok.
Code example (PySpark) based on http://spark.apache.org/docs/0.9.0/mllib-guide.html#linear-regression-2 :
So we have a line (f(X) = 1.6 * X) here. Fortunately, f(X) = X works!
The resulting model has nan coeffs: array([ nan]).
Furthermore, if you comment records line by line you will get:
- [-1.55897475e+296] coeff (the first record is commented),
- [-8.62115396e+104] coeff (the first two records are commented),
It looks like the implemented regression algorithms diverges somehow.
I get almost the same results on Ridge and Lasso.
I've also tested these inputs in scikit-learn and it works as expected there.
However, I'm still not sure whether it's a bug or SGD 'feature'. Should I preprocess my datasets somehow?