[SPARK-34448] Binary logistic regression incorrectly computes the intercept and coefficients when data is not centered - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.4.5, 3.0.0
Fix Version/s: 3.2.0
Component/s: ML, MLlib
Labels:
- correctness

Description

I have written up a fairly detailed gist that includes code to reproduce the bug, as well as the output of the code and some commentary:
https://gist.github.com/ykerzhner/51358780a6a4cc33266515f17bf98a96
To summarize: under certain conditions, the minimization that fits a binary logistic regression contains a bug that pulls the intercept value towards the log(odds) of the target data. This is mathematically only correct when the data comes from distributions with zero means. In general, this gives incorrect intercept values, and consequently incorrect coefficients as well.
As I am not so familiar with the spark code base, I have not been able to find this bug within the spark code itself. A hint to this bug is here:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L894-L904
based on the code, I don't believe that the features have zero means at this point, and so this heuristic is incorrect. But an incorrect starting point does not explain this bug. The minimizer should drift to the correct place. I was not able to find the code of the actual objective function that is being minimized.

Attachments

Issue Links

links to

[Github] Pull Request #31657 (zhengruifeng)

[Github] Pull Request #31693 (zhengruifeng)

Activity

People

Assignee:: Ruifeng Zheng

Reporter:: Yakov Kerzhner

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 16/Feb/21 15:12

Updated:: 26/Mar/21 17:32

Resolved:: 26/Mar/21 17:32