Description
We had been refactoring linear models for a long time, and there still are some works in the future. After some discussions among Huaxin Gao Sean R. Owen Weichen Xu Xiangrui Meng Ruifeng Zheng , we decide to gather related works under a sub-project Matrix, it includes:
- Blockification (vectorization of vectors)
- vectors are stacked into matrices, so that high-level BLAS can be used for better performance. (about ~3x faster on sparse datasets, up to ~18x faster on dense datasets, see
SPARK-31783for details). - Since 3.1.1, LoR/SVC/LiR/AFT supports blockification, and we need to blockify KMeans in the future.
- vectors are stacked into matrices, so that high-level BLAS can be used for better performance. (about ~3x faster on sparse datasets, up to ~18x faster on dense datasets, see
- Standardization (virutal centering)
- Existing impl of standardization in linear models does NOT center the vectors by removing the means, for the purpose of keeping dataset sparsity. However, this will cause feature values with small var be scaled to large values, and underlying solver like LBFGS can not efficiently handle this case. see
SPARK-34448for details. - If internal vectors are centered (like famous GLMNET), the convergence ratio will be better. In the case in
SPARK-34448, the number of iteration to convergence will be reduced from 93 to 6. Moreover, the final solution is much more close to the one in GLMNET. - Luckily, we find a new way to virtually center the vectors without densifying the dataset. Good results had been observed in LoR, we will take it into account in other linear models.
- Existing impl of standardization in linear models does NOT center the vectors by removing the means, for the purpose of keeping dataset sparsity. However, this will cause feature values with small var be scaled to large values, and underlying solver like LBFGS can not efficiently handle this case. see
- Initialization (To be discussed)
- Initializing model coef with a given model, should be beneficial to: 1, convergence ratio (should reduce number of iterations); 2, model stability (may obtain a new solution more close to the previous one);
- Early Stopping (To be discussed)
- we can compute the test error in the procedure (like tree models), and stop the training procedure if test error begin to increase;
If you want to add other features in these models, please comment in the ticket.