Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17133 Improvements to linear methods in Spark
  3. SPARK-17476

Proper handling for unseen labels in logistic regression training.

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • None
    • None
    • ML

    Description

      Now that logistic regression supports multiclass, it is possible to train on data that has K classes, but one or more of the classes does not appear in training. For example,

      (0.0, x1)
      (2.0, x2)
      ...
      

      Currently, logistic regression assumes that the outcome classes in the above dataset have three levels: 0, 1, 2. Since label 1 never appears, it should never be predicted. In theory, the coefficients should be zero and the intercept should be negative infinity. This can cause problems since we center the intercepts after training.

      We should discuss whether or not the intercepts actually tend to -infinity in practice, and whether or not we should even include them in training.

      Attachments

        Activity

          People

            Unassigned Unassigned
            sethah Seth Hendrickson
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: