Details
Description
This is a bug that appears while fitting a Logistic Regression model with `.setStandardization(false)` and `setFitIntercept(false)`. If the data matrix has one column with identical value, the resulting model is not correct. Specifically, the special column will always get a weight of 0, due to the special check inside the code. However, the correct solution, which is unique for L2 logistic regression, usually has nonzero weight.
I use the heart_scale data (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html) and manually augmented the data matrix with a column of one (available in the PR). The resulting data is run with reg=1.0, max_iter=1000, tol=1e9 on the following tools:
 libsvm
 scikitlearn
 sparkml
(Notice libsvm and scikitlearn use a slightly different formulation, so their regularizer is equivalently set to 1/270).
The first two will have an objective value 0.7275 and give a solution vector:
[0.03007516959304916, 0.09054186091216457, 0.09540306114820495, 0.02436266296315414, 0.01739437315700921, 0.0006404006623321454
0.06367837291956932, 0.0589096636263823, 0.1382458934368336, 0.06653302996539669, 0.07988499067852513, 0.1197789052423401, 0.1801661775839843, 0.01248615347419409].
Spark will produce an objective value 0.7278 and give a solution vector:
[0.029917351003921247,0.08993936770232434,0.09458507615360119,0.024920710363734895,0.018259589234194296,5.929247527202199E4,0.06362198973221662,0.059307008587031494,0.13886738997128056,0.0678246717525043,0.08062880450385658,0.12084979858539521,0.180460850026883,0.0]
Notice the last element of the weight vector is 0.
A even simpler example is:
import numpy as np from sklearn.datasets import load_svmlight_file from sklearn.linear_model import LogisticRegression x_train = np.array([[1, 1], [0, 1]]) y_train = np.array([1, 0]) model = LogisticRegression(tol=1e9, C=0.5, max_iter=1000, fit_intercept=False).fit(x_train, y_train) print model.coef_ [[ 0.22478867 0.02241016]]
The same data trained by the current solver also gives a different result, see the unit test in the PR.
Attachments
Issue Links
 is related to

SPARK13590 Document the behavior of spark.ml logistic regression and AFT survival regression when there are constant features
 Resolved
 links to