[SPARK-34765] Linear Models standardization optimization - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Resolved
Affects Version/s: 3.1.1, 3.2.0
Fix Version/s: None
Component/s: ML
Labels:
None

Description

Existing impl of standardization in linear models does NOT center the vectors by removing the means, for the purpose of keep the dataset sparsity.

However, this will cause feature values with small var be scaled to large values, and underlying solver like LBFGS can not efficiently handle this case. see ~~SPARK-34448~~ for details.

If internal vectors are centers (like other famous impl, i.e. GLMNET/Scikit-Learn), the convergence ratio will be better. In the case in ~~SPARK-34448~~, the number of iteration to convergence will be reduced from 93 to 6. Moreover, the final solution is much more close to the one in GLMNET.

luckily, we find a new way to 'virtually' center the vectors without densifying the dataset, iff:

1, fitIntercept is true;
2, no penalty on the intercept, it seem this is always true in existing impls;
3, no bounds on the intercept;

We will also need to check whether this new methods work in all other linear models (i.e, mlor/svc/lir/aft, etc.) as we expected , and introduce it into those models if possible.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Ruifeng Zheng

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 17/Mar/21 01:09

Updated:: 07/Jun/21 03:29

Resolved:: 07/Jun/21 03:29