Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30641

Project Matrix: Linear Models revisit and refactor

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Resolved
    • 3.1.0, 3.2.0
    • 3.4.0
    • ML, PySpark
    • None

    Description

      We had been refactoring linear models for a long time, and there still are some works in the future. After some discussions among Huaxin Gao Sean R. Owen Weichen Xu Xiangrui Meng Ruifeng Zheng , we decide to gather related works under a sub-project Matrix, it includes:

      1. Blockification (vectorization of vectors)
        • vectors are stacked into matrices, so that high-level BLAS can be used for better performance. (about ~3x faster on sparse datasets, up to ~18x faster on dense datasets, see SPARK-31783 for details).
        • Since 3.1.1, LoR/SVC/LiR/AFT supports blockification, and we need to blockify KMeans in the future.
      2. Standardization (virutal centering)
        • Existing impl of standardization in linear models does NOT center the vectors by removing the means, for the purpose of keeping dataset sparsity. However, this will cause feature values with small var be scaled to large values, and underlying solver like LBFGS can not efficiently handle this case. see SPARK-34448 for details.
        • If internal vectors are centered (like famous GLMNET), the convergence ratio will be better. In the case in SPARK-34448, the number of iteration to convergence will be reduced from 93 to 6. Moreover, the final solution is much more close to the one in GLMNET.
        • Luckily, we find a new way to virtually center the vectors without densifying the dataset. Good results had been observed in LoR, we will take it into account in other linear models.
      3. Initialization (To be discussed)
        • Initializing model coef with a given model, should be beneficial to: 1, convergence ratio (should reduce number of iterations); 2, model stability (may obtain a new solution more close to the previous one);
      4. Early Stopping (To be discussed)
        • we can compute the test error in the procedure (like tree models), and stop the training procedure if test error begin to increase;

       

            If you want to add other features in these models, please comment in the ticket.

      Attachments

        1.
        LinearSVC blockify input vectors Sub-task Resolved Ruifeng Zheng Actions
        2.
        LogisticRegression blockify input vectors Sub-task Resolved Ruifeng Zheng Actions
        3.
        LinearRegression blockify input vectors Sub-task Resolved Ruifeng Zheng Actions
        4.
        KMeans blockify input vectors Sub-task Resolved Apache Spark Actions
        5.
        ALS/MLP extend HasBlockSize Sub-task Resolved Huaxin Gao Actions
        6.
        GMM blockify input vectors Sub-task Resolved Ruifeng Zheng Actions
        7.
        AFT blockify input vectors Sub-task Resolved Ruifeng Zheng Actions
        8.
        Document usage of blockSize Sub-task Resolved Ruifeng Zheng Actions
        9.
        Performance test on java vectorization vs dot vs gemv vs gemm Sub-task Resolved Ruifeng Zheng Actions
        10.
        Performance test on dense and sparse datasets Sub-task Resolved Ruifeng Zheng Actions
        11.
        use MemoryUsage to control the size of block Sub-task Resolved Ruifeng Zheng Actions
        12.
        Huber loss Convergence Sub-task Resolved Unassigned Actions
        13.
        potential regression if use memoryUsage instead of numRows Sub-task Resolved Unassigned Actions
        14.
        adaptively blockify instances Sub-task Resolved Ruifeng Zheng Actions
        15.
        Linear Models standardization optimization Sub-task Resolved Unassigned Actions
        16.
        Refactor Logistic Aggregator - support virtual centering Sub-task Resolved Ruifeng Zheng Actions
        17.
        Binary Logistic Regression with intercept support centering Sub-task Resolved Ruifeng Zheng Actions
        18.
        Refactor AFT - support virtual centering Sub-task Resolved Ruifeng Zheng Actions
        19.
        Multinomial Logistic Regression with intercept support centering Sub-task Resolved Ruifeng Zheng Actions
        20.
        Refactor LinearSVC - support virtual centering Sub-task Resolved Ruifeng Zheng Actions
        21.
        Refactor LinearRegression - make huber support virtual centering Sub-task Resolved Ruifeng Zheng Actions
        22.
        add new gemv to skip array shape checking Sub-task Resolved Ruifeng Zheng Actions
        23.
        add a common softmax function Sub-task Resolved Ruifeng Zheng Actions
        24.
        optimize sparse GEMM by skipping bound checking Sub-task Resolved Ruifeng Zheng Actions
        25.
        ml.optim.aggregator avoid re-allocating buffers Sub-task Resolved Ruifeng Zheng Actions

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            podongfeng Ruifeng Zheng
            podongfeng Ruifeng Zheng
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment