[SPARK-1859] Linear, Ridge and Lasso Regressions with SGD yield unexpected results - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: 0.9.1
Fix Version/s: None
Component/s: MLlib
Labels:
Environment:

OS: Ubuntu Server 12.04 x64
PySpark

Description

Issue:
Linear Regression with SGD don't work as expected on any data, but lpsa.dat (example one).
Ridge Regression with SGD sometimes works ok.
Lasso Regression with SGD sometimes works ok.

Code example (PySpark) based on http://spark.apache.org/docs/0.9.0/mllib-guide.html#linear-regression-2 :

regression_example.py

parsedData = sc.parallelize([
    array([2400., 1500.]),
    array([240., 150.]),
    array([24., 15.]),
    array([2.4, 1.5]),
    array([0.24, 0.15])
])

# Build the model
model = LinearRegressionWithSGD.train(parsedData)
print model._coeffs

So we have a line (f(X) = 1.6 * X) here. Fortunately, f(X) = X works!
The resulting model has nan coeffs: array([ nan]).
Furthermore, if you comment records line by line you will get:

[-1.55897475e+296] coeff (the first record is commented),
[-8.62115396e+104] coeff (the first two records are commented),
etc

It looks like the implemented regression algorithms diverges somehow.

I get almost the same results on Ridge and Lasso.

I've also tested these inputs in scikit-learn and it works as expected there.

However, I'm still not sure whether it's a bug or SGD 'feature'. Should I preprocess my datasets somehow?

Attachments

Issue Links

Add Link

is related to

SPARK-1585 Not robust Lasso causes Infinity on weights and losses

Closed

Delete this link

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Unassigned

Reporter:: Vlad Frolov

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 16/May/14 00:00

Updated:: 26/Jun/15 04:45

Resolved:: 20/May/14 06:43

Agile

View on Board

Linear, Ridge and Lasso Regressions with SGD yield unexpected results

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Agile

Slack

Issue deployment