Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-13073

creating R like summary for logistic Regression in Spark - Scala

Details

    • New Feature
    • Status: Resolved
    • Minor
    • Resolution: Incomplete
    • None
    • None
    • ML, MLlib

    Description

      Currently Spark ML provides Coefficients for logistic regression. To evaluate the trained model tests like wald test, chi square tests and their results to be summarized and display like GLM summary of R

      Attachments

        Issue Links

          Activity

            It sounds reasonable to provide the same printed summary in Scala, Java, and Python as in R. Perhaps it can be provided as a toString method for the LogisticRegressionModel.summary member?

            josephkb Joseph K. Bradley added a comment - It sounds reasonable to provide the same printed summary in Scala, Java, and Python as in R. Perhaps it can be provided as a toString method for the LogisticRegressionModel.summary member?
            mbaddar1 Mohamed Baddar added a comment -

            josephkb Can you assign this to me as a starter task ?

            mbaddar1 Mohamed Baddar added a comment - josephkb Can you assign this to me as a starter task ?
            mbaddar1 Mohamed Baddar added a comment -

            josephkb After looking at source code of org.apache.spark.ml.classification.LogisticRegressionSummary and org.apache.spark.ml.classification.LogisticRegressionTrainingSummary

            and after running a sample GLM in R which has the following output

            Call:
            glm(formula = mpg ~ wt + hp + gear, family = gaussian(), data = mtcars)

            Deviance Residuals:
            Min 1Q Median 3Q Max
            -3.3712 -1.9017 -0.3444 0.9883 6.0655

            Coefficients:
            Estimate Std. Error t value Pr(>|t|)
            (Intercept) 32.013657 4.632264 6.911 1.64e-07 ***
            wt -3.197811 0.846546 -3.777 0.000761 ***
            hp -0.036786 0.009891 -3.719 0.000888 ***
            gear 1.019981 0.851408 1.198 0.240963

            Signif. codes: 0 ‘**’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

            (Dispersion parameter for gaussian family taken to be 6.626347)

            Null deviance: 1126.05 on 31 degrees of freedom
            Residual deviance: 185.54 on 28 degrees of freedom
            AIC: 157.05

            Number of Fisher Scoring iterations: 2

            I have the following comments :
            1-I think we should add the following member to LogisticRegressionSummary : coefficients and residuals

            2-toString method should be overridden in the following classes :
            org.apache.spark.ml.classification.BinaryLogisticRegressionSummary and org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary

            Any other suggestions ? Please correct me if have missed something.

            mbaddar1 Mohamed Baddar added a comment - josephkb After looking at source code of org.apache.spark.ml.classification.LogisticRegressionSummary and org.apache.spark.ml.classification.LogisticRegressionTrainingSummary and after running a sample GLM in R which has the following output Call: glm(formula = mpg ~ wt + hp + gear, family = gaussian(), data = mtcars) Deviance Residuals: Min 1Q Median 3Q Max -3.3712 -1.9017 -0.3444 0.9883 6.0655 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 32.013657 4.632264 6.911 1.64e-07 *** wt -3.197811 0.846546 -3.777 0.000761 *** hp -0.036786 0.009891 -3.719 0.000888 *** gear 1.019981 0.851408 1.198 0.240963 — Signif. codes: 0 ‘** ’ 0.001 ‘ ’ 0.01 ‘ ’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 6.626347) Null deviance: 1126.05 on 31 degrees of freedom Residual deviance: 185.54 on 28 degrees of freedom AIC: 157.05 Number of Fisher Scoring iterations: 2 I have the following comments : 1-I think we should add the following member to LogisticRegressionSummary : coefficients and residuals 2-toString method should be overridden in the following classes : org.apache.spark.ml.classification.BinaryLogisticRegressionSummary and org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary Any other suggestions ? Please correct me if have missed something.
            mbaddar1 Mohamed Baddar added a comment - - edited

            josephkb After more investigation in the code , and to make minimal changes on the code.My previous suggestion may not be suitable .I think we can implement toString version for BinaryLogisticRegressionSummary that give different information than R summary. It will create string representation for the following members :
            precision
            recall
            fmeasure
            is there any comment before i start the PR ?

            mbaddar1 Mohamed Baddar added a comment - - edited josephkb After more investigation in the code , and to make minimal changes on the code.My previous suggestion may not be suitable .I think we can implement toString version for BinaryLogisticRegressionSummary that give different information than R summary. It will create string representation for the following members : precision recall fmeasure is there any comment before i start the PR ?
            apachespark Apache Spark added a comment -

            User 'mbaddar1' has created a pull request for this issue:
            https://github.com/apache/spark/pull/11729

            apachespark Apache Spark added a comment - User 'mbaddar1' has created a pull request for this issue: https://github.com/apache/spark/pull/11729
            mbaddar1 Mohamed Baddar added a comment -

            josephkb Can any one of the admins verify this PR

            mbaddar1 Mohamed Baddar added a comment - josephkb Can any one of the admins verify this PR
            samsudhin Samsudhin added a comment -

            @Mohammed Baddar i checked on your comment - 10/Mar/16 13:28

            You have executed Linear Regression Summary. For Logistic Regression the summary would be like below,

            > summary(glm(formula = vs ~ wt + hp + gear, family = binomial(), data = mtcars))

            Call:
            glm(formula = vs ~ wt + hp + gear, family = binomial(), data = mtcars)

            Deviance Residuals:
            Min 1Q Median 3Q Max
            -1.79167 -0.19535 -0.00689 0.43289 1.54872

            Coefficients:
            Estimate Std. Error z value Pr(>|z|)
            (Intercept) 11.17572 9.26728 1.206 0.2278
            wt 0.55553 1.58811 0.350 0.7265
            hp -0.08514 0.03618 -2.353 0.0186 *
            gear -0.64723 1.42248 -0.455 0.6491

            Signif. codes: 0 ‘**’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

            (Dispersion parameter for binomial family taken to be 1)

            Null deviance: 43.86 on 31 degrees of freedom
            Residual deviance: 15.89 on 28 degrees of freedom
            AIC: 23.89

            Number of Fisher Scoring iterations: 7

            samsudhin Samsudhin added a comment - @Mohammed Baddar i checked on your comment - 10/Mar/16 13:28 You have executed Linear Regression Summary. For Logistic Regression the summary would be like below, > summary(glm(formula = vs ~ wt + hp + gear, family = binomial(), data = mtcars)) Call: glm(formula = vs ~ wt + hp + gear, family = binomial(), data = mtcars) Deviance Residuals: Min 1Q Median 3Q Max -1.79167 -0.19535 -0.00689 0.43289 1.54872 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 11.17572 9.26728 1.206 0.2278 wt 0.55553 1.58811 0.350 0.7265 hp -0.08514 0.03618 -2.353 0.0186 * gear -0.64723 1.42248 -0.455 0.6491 — Signif. codes: 0 ‘** ’ 0.001 ‘ ’ 0.01 ‘ ’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 43.86 on 31 degrees of freedom Residual deviance: 15.89 on 28 degrees of freedom AIC: 23.89 Number of Fisher Scoring iterations: 7
            mbaddar1 Mohamed Baddar added a comment -

            Thanks samsudhin I noticed the difference in params. Do you have any other comments on my notes

            mbaddar1 Mohamed Baddar added a comment - Thanks samsudhin I noticed the difference in params. Do you have any other comments on my notes

            Sorry for the slow response. I think this approach is fine, where we match the summary format of R and provide whatever info is available. I'll comment on the PR.

            josephkb Joseph K. Bradley added a comment - Sorry for the slow response. I think this approach is fine, where we match the summary format of R and provide whatever info is available. I'll comment on the PR.
            samsudhin Samsudhin added a comment -

            mbaddar1 any update on this

            samsudhin Samsudhin added a comment - mbaddar1 any update on this
            mbaddar1 Mohamed Baddar added a comment -

            samsudhin I will work on it soon

            mbaddar1 Mohamed Baddar added a comment - samsudhin I will work on it soon

            People

              Unassigned Unassigned
              samsudhin Samsudhin
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: