Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6705

MLLIB ML Pipeline's Logistic Regression has no intercept term

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.4.0
    • ML, MLlib
    • None

    Description

      Currently, the ML Pipeline's LogisticRegression.scala file does not allow setting whether or not to fit an intercept term. Therefore, the pipeline defers to LogisticRegressionWithLBFGS which does not use an intercept term. This makes sense from a performance point of view because adding an intercept term requires memory allocation.

      However, this is undesirable statistically, since the statistical default is usually to include an intercept term, and one needs to have a very strong
      reason for not having an intercept term.

      Explicitly modeling the intercept by adding a column of all 1s does not
      work because LogisticRegressionWithLBFGS forces column normalization, and a column of all 1s has 0 variance and so dividing by 0 kills it.

      We should open up the API for the ML Pipeline to explicitly allow controlling whether or not to fit an intercept.

      Attachments

        Activity

          People

            omede Omede Firouz
            omede Omede Firouz
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: