Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-9647 MLlib + SparkR integration for 1.6
  3. SPARK-9836

Provide R-like summary statistics for ordinary least squares via normal equation solver

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.6.0
    • Component/s: ML
    • Labels:
      None
    • Target Version/s:

      Description

      In R, model fitting comes with summary statistics. We can provide most of those via normal equation solver (SPARK-9834). If some statistics requires additional passes to the dataset, we can expose an option to let users select desired statistics before model fitting.

      > summary(model)
      
      Call:
      glm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
      
      Deviance Residuals: 
           Min        1Q    Median        3Q       Max  
      -1.30711  -0.25713  -0.05325   0.19542   1.41253  
      
      Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
      (Intercept)         2.2514     0.3698   6.089 9.57e-09 ***
      Sepal.Width         0.8036     0.1063   7.557 4.19e-12 ***
      Speciesversicolor   1.4587     0.1121  13.012  < 2e-16 ***
      Speciesvirginica    1.9468     0.1000  19.465  < 2e-16 ***
      ---
      Signif. codes:  
      0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
      
      (Dispersion parameter for gaussian family taken to be 0.1918059)
      
          Null deviance: 102.168  on 149  degrees of freedom
      Residual deviance:  28.004  on 146  degrees of freedom
      AIC: 183.94
      
      Number of Fisher Scoring iterations: 2
      

        Issue Links

          Activity

          Hide
          mbaddar Mohamed Baddar added a comment -

          Hello , Can i be assigned to This Task
          Thanks

          Show
          mbaddar Mohamed Baddar added a comment - Hello , Can i be assigned to This Task Thanks
          Hide
          mengxr Xiangrui Meng added a comment -

          This JIRA is still blocked by SPARK-10668. If you are a first-time contributor, I would recommend taking a starter task: https://issues.apache.org/jira/issues/?filter=12333209.

          Show
          mengxr Xiangrui Meng added a comment - This JIRA is still blocked by SPARK-10668 . If you are a first-time contributor, I would recommend taking a starter task: https://issues.apache.org/jira/issues/?filter=12333209 .
          Hide
          mbaddar Mohamed Baddar added a comment - - edited

          Thanks a lot Xiangrui Meng , i will try one of the starter tasks , but seems they are all taken , if so , what should i do next ?

          Show
          mbaddar Mohamed Baddar added a comment - - edited Thanks a lot Xiangrui Meng , i will try one of the starter tasks , but seems they are all taken , if so , what should i do next ?
          Hide
          yanboliang Yanbo Liang added a comment -

          I will work on it.

          Show
          yanboliang Yanbo Liang added a comment - I will work on it.
          Hide
          mengxr Xiangrui Meng added a comment -

          Yanbo Liang Note that the feature freeze deadline for 1.6 is the end of the month. You can check the implementation in https://github.com/AlteryxLabs/sparkGLM and the unit tests should be verified against R lm/glm. cc Chris Freeman and Dan Putler.

          Show
          mengxr Xiangrui Meng added a comment - Yanbo Liang Note that the feature freeze deadline for 1.6 is the end of the month. You can check the implementation in https://github.com/AlteryxLabs/sparkGLM and the unit tests should be verified against R lm/glm. cc Chris Freeman and Dan Putler.
          Hide
          mengxr Xiangrui Meng added a comment -

          Sorry for late response! There are more starter tasks coming out under SPARK-11337. Are you interested?

          Show
          mengxr Xiangrui Meng added a comment - Sorry for late response! There are more starter tasks coming out under SPARK-11337 . Are you interested?
          Hide
          yanboliang Yanbo Liang added a comment -

          OK, I will try to finish it before the end of the month.

          Show
          yanboliang Yanbo Liang added a comment - OK, I will try to finish it before the end of the month.
          Hide
          yanboliang Yanbo Liang added a comment - - edited

          Xiangrui Meng After survey I found that "Coefficients: Estimate Std. Error t value Pr(>|t|) " can get from OLS/WLS(by matrix inverse/diagonalization), "Deviance Residuals" is a general statistic variable, I will add these statistics in this task.
          As to the remaining part

          Null deviance: 102.168 on 149 degrees of freedom
          Residual deviance: 28.004 on 146 degrees of freedom
          AIC: 183.94

          Number of Fisher Scoring iterations: 2

          Some of the statistics variables depends upon IRLS(SPARK-9835). I found you have open SPARK-9837 to track summary statistics for GLMs via IRLS, so these statistics will be work of SPARK-9837. Please correct me if have misunderstand.

          Show
          yanboliang Yanbo Liang added a comment - - edited Xiangrui Meng After survey I found that "Coefficients: Estimate Std. Error t value Pr(>|t|) " can get from OLS/WLS(by matrix inverse/diagonalization), "Deviance Residuals" is a general statistic variable, I will add these statistics in this task. As to the remaining part Null deviance: 102.168 on 149 degrees of freedom Residual deviance: 28.004 on 146 degrees of freedom AIC: 183.94 Number of Fisher Scoring iterations: 2 Some of the statistics variables depends upon IRLS( SPARK-9835 ). I found you have open SPARK-9837 to track summary statistics for GLMs via IRLS, so these statistics will be work of SPARK-9837 . Please correct me if have misunderstand.
          Hide
          mengxr Xiangrui Meng added a comment - - edited

          Yes, this JIRA is only for the normal equation solver and linear regression. We don't need to add all statistics in a single PR. Let's add statistics that can be easily derived from `diag(A^T W A)` and the residuals.

          Show
          mengxr Xiangrui Meng added a comment - - edited Yes, this JIRA is only for the normal equation solver and linear regression. We don't need to add all statistics in a single PR. Let's add statistics that can be easily derived from `diag(A^T W A)` and the residuals.
          Hide
          apachespark Apache Spark added a comment -

          User 'yanboliang' has created a pull request for this issue:
          https://github.com/apache/spark/pull/9413

          Show
          apachespark Apache Spark added a comment - User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/9413
          Hide
          mengxr Xiangrui Meng added a comment -

          Issue resolved by pull request 9413
          https://github.com/apache/spark/pull/9413

          Show
          mengxr Xiangrui Meng added a comment - Issue resolved by pull request 9413 https://github.com/apache/spark/pull/9413

            People

            • Assignee:
              yanboliang Yanbo Liang
              Reporter:
              mengxr Xiangrui Meng
              Shepherd:
              Xiangrui Meng
            • Votes:
              2 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development