Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27577

Wrong thresholds selected by BinaryClassificationMetrics when downsampling

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0, 2.4.1, 2.4.2, 3.0.0
    • 2.3.4, 2.4.4, 3.0.0
    • MLlib
    • None

    Description

      In binary metrics, a threshold means any instance with a score >= threshold will be considered as positive.

      However, in the existing implementation:

      1. When `numBins` is set when creating a `BinaryClassificationMetrics` object, all records (ordered by scores in DESC) will be grouped into chunks.
      2. In each chunk, statistics (in `BinaryLabelCounter`) of records are accumulated while the first record's score (also the largest) is selected as threshold.
      3. All these generated/sampled records form a new smaller data set to calculate binary metrics.

      At the second step, it brings the BUG that the score/threshold of a record is correlated with wrong values like larger `true positive`, smaller `false negative` when calculating `recallByThresholds`, `precisionByThresholds`, etc.

      Thus, the BUG fix is straightfoward. Let's pick up the last records's core in all chunks as thresholds while statistics merged.

      Attachments

        Issue Links

          Activity

            People

              shishaochen Shaochen Shi
              shishaochen Shaochen Shi
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: