Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-13029

Logistic regression returns inaccurate results when there is a column with identical value, and fit_intercept=false

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Won't Fix
    • Affects Version/s: 1.5.2, 1.6.0
    • Fix Version/s: None
    • Component/s: ML
    • Labels:
      None

      Description

      This is a bug that appears while fitting a Logistic Regression model with `.setStandardization(false)` and `setFitIntercept(false)`. If the data matrix has one column with identical value, the resulting model is not correct. Specifically, the special column will always get a weight of 0, due to the special check inside the code. However, the correct solution, which is unique for L2 logistic regression, usually has non-zero weight.

      I use the heart_scale data (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html) and manually augmented the data matrix with a column of one (available in the PR). The resulting data is run with reg=1.0, max_iter=1000, tol=1e-9 on the following tools:

      • libsvm
      • scikit-learn
      • sparkml

      (Notice libsvm and scikit-learn use a slightly different formulation, so their regularizer is equivalently set to 1/270).

      The first two will have an objective value 0.7275 and give a solution vector:
      [0.03007516959304916, 0.09054186091216457, 0.09540306114820495, 0.02436266296315414, 0.01739437315700921, -0.0006404006623321454
      0.06367837291956932, -0.0589096636263823, 0.1382458934368336, 0.06653302996539669, 0.07988499067852513, 0.1197789052423401, 0.1801661775839843, -0.01248615347419409].

      Spark will produce an objective value 0.7278 and give a solution vector:
      [0.029917351003921247,0.08993936770232434,0.09458507615360119,0.024920710363734895,0.018259589234194296,5.929247527202199E-4,0.06362198973221662,-0.059307008587031494,0.13886738997128056,0.0678246717525043,0.08062880450385658,0.12084979858539521,0.180460850026883,0.0]

      Notice the last element of the weight vector is 0.

      A even simpler example is:

      benchmark.py
      import numpy as np
      from sklearn.datasets import load_svmlight_file
      from sklearn.linear_model import LogisticRegression
      x_train = np.array([[1, 1], [0, 1]])
      y_train = np.array([1, 0])
      model = LogisticRegression(tol=1e-9, C=0.5, max_iter=1000, fit_intercept=False).fit(x_train, y_train)
      print model.coef_
      
      [[ 0.22478867 -0.02241016]]
      
      

      The same data trained by the current solver also gives a different result, see the unit test in the PR.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                coderxiang Shuo Xiang
                Reporter:
                coderxiang Shuo Xiang
                Shepherd:
                Xiangrui Meng
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: