[SPARK-30641] Project Matrix: Linear Models revisit and refactor - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Resolved
Affects Version/s: 3.1.0, 3.2.0
Fix Version/s: 3.4.0
Component/s: ML, PySpark
Labels:
None

Description

We had been refactoring linear models for a long time, and there still are some works in the future. After some discussions among huaxingao srowen weichenxu123 mengxr podongfeng , we decide to gather related works under a sub-project Matrix, it includes:

Blockification (vectorization of vectors)
- vectors are stacked into matrices, so that high-level BLAS can be used for better performance. (about ~3x faster on sparse datasets, up to ~18x faster on dense datasets, see ~~SPARK-31783~~ for details).
- Since 3.1.1, LoR/SVC/LiR/AFT supports blockification, and we need to blockify KMeans in the future.
Standardization (virutal centering)
- Existing impl of standardization in linear models does NOT center the vectors by removing the means, for the purpose of keeping dataset sparsity. However, this will cause feature values with small var be scaled to large values, and underlying solver like LBFGS can not efficiently handle this case. see ~~SPARK-34448~~ for details.
- If internal vectors are centered (like famous GLMNET), the convergence ratio will be better. In the case in ~~SPARK-34448~~, the number of iteration to convergence will be reduced from 93 to 6. Moreover, the final solution is much more close to the one in GLMNET.
- Luckily, we find a new way to virtually center the vectors without densifying the dataset. Good results had been observed in LoR, we will take it into account in other linear models.
Initialization (To be discussed)
- Initializing model coef with a given model, should be beneficial to: 1, convergence ratio (should reduce number of iterations); 2, model stability (may obtain a new solution more close to the previous one);
Early Stopping (To be discussed)
- we can compute the test error in the procedure (like tree models), and stop the training procedure if test error begin to increase;

If you want to add other features in these models, please comment in the ticket.

Attachments

Sub-Tasks

1.	LinearSVC blockify input vectors	Resolved	Ruifeng Zheng
2.	LogisticRegression blockify input vectors	Resolved	Ruifeng Zheng
3.	LinearRegression blockify input vectors	Resolved	Ruifeng Zheng
4.	KMeans blockify input vectors	Resolved	Apache Spark
5.	ALS/MLP extend HasBlockSize	Resolved	Huaxin Gao
6.	GMM blockify input vectors	Resolved	Ruifeng Zheng
7.	AFT blockify input vectors	Resolved	Ruifeng Zheng
8.	Document usage of blockSize	Resolved	Ruifeng Zheng
9.	Performance test on java vectorization vs dot vs gemv vs gemm	Resolved	Ruifeng Zheng
10.	Performance test on dense and sparse datasets	Resolved	Ruifeng Zheng
11.	use MemoryUsage to control the size of block	Resolved	Ruifeng Zheng
12.	Huber loss Convergence	Resolved	Unassigned
13.	potential regression if use memoryUsage instead of numRows	Resolved	Unassigned
14.	adaptively blockify instances	Resolved	Ruifeng Zheng
15.	Linear Models standardization optimization	Resolved	Unassigned
16.	Refactor Logistic Aggregator - support virtual centering	Resolved	Ruifeng Zheng
17.	Binary Logistic Regression with intercept support centering	Resolved	Ruifeng Zheng
18.	Refactor AFT - support virtual centering	Resolved	Ruifeng Zheng
19.	Multinomial Logistic Regression with intercept support centering	Resolved	Ruifeng Zheng
20.	Refactor LinearSVC - support virtual centering	Resolved	Ruifeng Zheng
21.	Refactor LinearRegression - make huber support virtual centering	Resolved	Ruifeng Zheng
22.	add new gemv to skip array shape checking	Resolved	Ruifeng Zheng
23.	add a common softmax function	Resolved	Ruifeng Zheng
24.	optimize sparse GEMM by skipping bound checking	Resolved	Ruifeng Zheng
25.	ml.optim.aggregator avoid re-allocating buffers	Resolved	Ruifeng Zheng

Activity

People

Assignee:: Ruifeng Zheng

Reporter:: Ruifeng Zheng

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 25/Jan/20 13:12

Updated:: 27/Jun/22 09:16

Resolved:: 27/Jun/22 09:16