Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32472

Expose confusion matrix elements by threshold in BinaryClassificationMetrics

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 3.0.0
    • Fix Version/s: None
    • Component/s: MLlib
    • Labels:
      None

      Description

      Currently, the only thresholded metrics available from BinaryClassificationMetrics are precision, recall, f-measure, and (indirectly through roc()) the false positive rate.

      Unfortunately, you can't always compute the individual thresholded confusion matrix elements (TP, FP, TN, FN) from these quantities. You can make a system of equations out of the existing thresholded metrics and the total count, but they become underdetermined when there are no true positives.

      Fortunately, the individual confusion matrix elements by threshold are already computed and sitting in the confusions variable. It would be helpful to expose these elements directly. The easiest way would probably be by adding methods like 

      def truePositivesByThreshold(): RDD[(Double, Double)] = confusions.map{ case (t, c) => (t, c.weightedTruePositives) }

      An alternative could be to expose the entire RDD[(Double, BinaryConfusionMatrix)] in one method, but BinaryConfusionMatrix is also currently package private.

      The closest issue to this I found was this one for adding new calculations to BinaryClassificationMetrics https://issues.apache.org/jira/browse/SPARK-18844, which was closed without any changes being merged.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              kmoore Kevin Moore
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: