Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21806

BinaryClassificationMetrics pr(): first point (0.0, 1.0) is misleading

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.2.0
    • 2.3.0
    • MLlib

    Description

      I would like to reference to a discussion in scikit-learn, as this behavior is probably based on the scikit implementation.

      Summary:
      Currently, the y-axis intercept of the precision recall curve is set to (0.0, 1.0). This behavior is not ideal in certain edge cases (see example below) and can also have an impact on cross validation, when optimization metric is set to "areaUnderPR".

      Please consider blucena's post for possible alternatives.

      Edge case example:
      Consider a bad classifier, that assigns a high probability to all samples. A possible output might look like this:

      Real label Score
      1.0 1.0
      0.0 1.0
      0.0 1.0
      0.0 1.0
      0.0 1.0
      0.0 1.0
      0.0 1.0
      0.0 1.0
      0.0 1.0
      0.0 0.95
      0.0 0.95
      1.0 1.0

      This results in the following pr points (first line set by default):

      Threshold Recall Precision
      1.0 0.0 1.0
      0.95 1.0 0.2
      0.0 1.0 0,16

      The auPRC would be around 0.6. Classifiers with a more differentiated probability assignment will be falsely assumed to perform worse in regard to this auPRC.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            srowen Sean R. Owen
            MKami Marc Kaminski
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment