Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27577

Wrong thresholds selected by BinaryClassificationMetrics when downsampling

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0, 2.4.1, 2.4.2, 3.0.0
    • Fix Version/s: 2.3.4, 2.4.4, 3.0.0
    • Component/s: MLlib
    • Labels:
      None

      Description

      In binary metrics, a threshold means any instance with a score >= threshold will be considered as positive.

      However, in the existing implementation:

      1. When `numBins` is set when creating a `BinaryClassificationMetrics` object, all records (ordered by scores in DESC) will be grouped into chunks.
      2. In each chunk, statistics (in `BinaryLabelCounter`) of records are accumulated while the first record's score (also the largest) is selected as threshold.
      3. All these generated/sampled records form a new smaller data set to calculate binary metrics.

      At the second step, it brings the BUG that the score/threshold of a record is correlated with wrong values like larger `true positive`, smaller `false negative` when calculating `recallByThresholds`, `precisionByThresholds`, etc.

      Thus, the BUG fix is straightfoward. Let's pick up the last records's core in all chunks as thresholds while statistics merged.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                shishaochen Shaochen Shi
                Reporter:
                shishaochen Shaochen Shi
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: