[SPARK-13073] creating R like summary for logistic Regression in Spark - Scala - ASF JIRA

Details

Type: New Feature
Status: Resolved
Priority: Minor
Resolution: Incomplete
Affects Version/s: None
Fix Version/s: None
Component/s: ML, MLlib
Labels:
- bulk-closed

Description

Currently Spark ML provides Coefficients for logistic regression. To evaluate the trained model tests like wald test, chi square tests and their results to be summarized and display like GLM summary of R

Attachments

Issue Links

relates to

SPARK-13925 Expose R-like summary statistics in SparkR::glm for more family and link functions

Resolved

links to

[Github] Pull Request #11729 (mbaddar1)

Activity

Ascending order - Click to sort in descending order

Joseph K. Bradley added a comment - 02/Mar/16 00:26

It sounds reasonable to provide the same printed summary in Scala, Java, and Python as in R. Perhaps it can be provided as a toString method for the LogisticRegressionModel.summary member?

Joseph K. Bradley added a comment - 02/Mar/16 00:26 It sounds reasonable to provide the same printed summary in Scala, Java, and Python as in R. Perhaps it can be provided as a toString method for the LogisticRegressionModel.summary member?

Mohamed Baddar added a comment - 09/Mar/16 20:59

josephkb Can you assign this to me as a starter task ?

Mohamed Baddar added a comment - 09/Mar/16 20:59 josephkb Can you assign this to me as a starter task ?

Mohamed Baddar added a comment - 10/Mar/16 13:28

josephkb After looking at source code of org.apache.spark.ml.classification.LogisticRegressionSummary and org.apache.spark.ml.classification.LogisticRegressionTrainingSummary

and after running a sample GLM in R which has the following output

Call:
glm(formula = mpg ~ wt + hp + gear, family = gaussian(), data = mtcars)

Deviance Residuals:
Min 1Q Median 3Q Max
-3.3712 -1.9017 -0.3444 0.9883 6.0655

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 32.013657 4.632264 6.911 1.64e-07 ***
wt -3.197811 0.846546 -3.777 0.000761 ***
hp -0.036786 0.009891 -3.719 0.000888 ***
gear 1.019981 0.851408 1.198 0.240963
—
Signif. codes: 0 ‘**’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 6.626347)

Null deviance: 1126.05 on 31 degrees of freedom
Residual deviance: 185.54 on 28 degrees of freedom
AIC: 157.05

Number of Fisher Scoring iterations: 2

I have the following comments :
1-I think we should add the following member to LogisticRegressionSummary : coefficients and residuals

2-toString method should be overridden in the following classes :
org.apache.spark.ml.classification.BinaryLogisticRegressionSummary and org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary

Any other suggestions ? Please correct me if have missed something.

Mohamed Baddar added a comment - 10/Mar/16 13:28 josephkb After looking at source code of org.apache.spark.ml.classification.LogisticRegressionSummary and org.apache.spark.ml.classification.LogisticRegressionTrainingSummary and after running a sample GLM in R which has the following output Call: glm(formula = mpg ~ wt + hp + gear, family = gaussian(), data = mtcars) Deviance Residuals: Min 1Q Median 3Q Max -3.3712 -1.9017 -0.3444 0.9883 6.0655 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 32.013657 4.632264 6.911 1.64e-07 *** wt -3.197811 0.846546 -3.777 0.000761 *** hp -0.036786 0.009891 -3.719 0.000888 *** gear 1.019981 0.851408 1.198 0.240963 — Signif. codes: 0 ‘** ’ 0.001 ‘ ’ 0.01 ‘ ’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 6.626347) Null deviance: 1126.05 on 31 degrees of freedom Residual deviance: 185.54 on 28 degrees of freedom AIC: 157.05 Number of Fisher Scoring iterations: 2 I have the following comments : 1-I think we should add the following member to LogisticRegressionSummary : coefficients and residuals 2-toString method should be overridden in the following classes : org.apache.spark.ml.classification.BinaryLogisticRegressionSummary and org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary Any other suggestions ? Please correct me if have missed something.

Mohamed Baddar added a comment - 13/Mar/16 10:50 - edited

josephkb After more investigation in the code , and to make minimal changes on the code.My previous suggestion may not be suitable .I think we can implement toString version for BinaryLogisticRegressionSummary that give different information than R summary. It will create string representation for the following members :
precision
recall
fmeasure
is there any comment before i start the PR ?

Mohamed Baddar added a comment - 13/Mar/16 10:50 - edited josephkb After more investigation in the code , and to make minimal changes on the code.My previous suggestion may not be suitable .I think we can implement toString version for BinaryLogisticRegressionSummary that give different information than R summary. It will create string representation for the following members : precision recall fmeasure is there any comment before i start the PR ?

Apache Spark added a comment - 15/Mar/16 12:19

User 'mbaddar1' has created a pull request for this issue:
https://github.com/apache/spark/pull/11729

Apache Spark added a comment - 15/Mar/16 12:19 User 'mbaddar1' has created a pull request for this issue: https://github.com/apache/spark/pull/11729

Mohamed Baddar added a comment - 24/Mar/16 10:16

josephkb Can any one of the admins verify this PR

Mohamed Baddar added a comment - 24/Mar/16 10:16 josephkb Can any one of the admins verify this PR

Samsudhin added a comment - 24/Mar/16 13:44

@Mohammed Baddar i checked on your comment - 10/Mar/16 13:28

You have executed Linear Regression Summary. For Logistic Regression the summary would be like below,

> summary(glm(formula = vs ~ wt + hp + gear, family = binomial(), data = mtcars))

Call:
glm(formula = vs ~ wt + hp + gear, family = binomial(), data = mtcars)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.79167 -0.19535 -0.00689 0.43289 1.54872

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 11.17572 9.26728 1.206 0.2278
wt 0.55553 1.58811 0.350 0.7265
hp -0.08514 0.03618 -2.353 0.0186 *
gear -0.64723 1.42248 -0.455 0.6491
—
Signif. codes: 0 ‘**’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 43.86 on 31 degrees of freedom
Residual deviance: 15.89 on 28 degrees of freedom
AIC: 23.89

Number of Fisher Scoring iterations: 7

Samsudhin added a comment - 24/Mar/16 13:44 @Mohammed Baddar i checked on your comment - 10/Mar/16 13:28 You have executed Linear Regression Summary. For Logistic Regression the summary would be like below, > summary(glm(formula = vs ~ wt + hp + gear, family = binomial(), data = mtcars)) Call: glm(formula = vs ~ wt + hp + gear, family = binomial(), data = mtcars) Deviance Residuals: Min 1Q Median 3Q Max -1.79167 -0.19535 -0.00689 0.43289 1.54872 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 11.17572 9.26728 1.206 0.2278 wt 0.55553 1.58811 0.350 0.7265 hp -0.08514 0.03618 -2.353 0.0186 * gear -0.64723 1.42248 -0.455 0.6491 — Signif. codes: 0 ‘** ’ 0.001 ‘ ’ 0.01 ‘ ’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 43.86 on 31 degrees of freedom Residual deviance: 15.89 on 28 degrees of freedom AIC: 23.89 Number of Fisher Scoring iterations: 7

Mohamed Baddar added a comment - 24/Mar/16 13:56

Thanks samsudhin I noticed the difference in params. Do you have any other comments on my notes

Mohamed Baddar added a comment - 24/Mar/16 13:56 Thanks samsudhin I noticed the difference in params. Do you have any other comments on my notes

Joseph K. Bradley added a comment - 04/Apr/16 20:25

Sorry for the slow response. I think this approach is fine, where we match the summary format of R and provide whatever info is available. I'll comment on the PR.

Joseph K. Bradley added a comment - 04/Apr/16 20:25 Sorry for the slow response. I think this approach is fine, where we match the summary format of R and provide whatever info is available. I'll comment on the PR.

Samsudhin added a comment - 03/May/16 07:57

mbaddar1 any update on this

Samsudhin added a comment - 03/May/16 07:57 mbaddar1 any update on this

Mohamed Baddar added a comment - 03/May/16 09:07

samsudhin I will work on it soon

Mohamed Baddar added a comment - 03/May/16 09:07 samsudhin I will work on it soon

People

Assignee:: Unassigned

Reporter:: Samsudhin

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 29/Jan/16 04:19

Updated:: 21/May/19 04:32

Resolved:: 21/May/19 04:32