Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30210

Give more informative error for BinaryClassificationEvaluator when data with only one label is provided

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Cannot Reproduce
    • 3.1.0
    • None
    • ML
    • None
    • Pyspark on Databricks

    Description

      Hi all,

      When I was trying to do some machine learning work with pyspark I ran into a confusing error message:

      # Model and train/test set generated...
      {{ evaluator = BinaryClassificationEvaluator(labelCol=label, metricName='areaUnderROC')}}
      {{ prediction = model.transform(test_data)}}
      {{ auc = evaluator.evaluate(prediction)}}

      org.apache.spark.SparkException: Job aborted due to stage failure: Task 37 in stage 21.0 failed 4 times, most recent failure: Lost task 37.3 in stage 21.0 (TID 2811, 10.139.65.48, executor 16): java.lang.ArrayIndexOutOfBoundsException

      After some investigation, I found that the issue was that the data I was trying to predict on only had one label represented, rather than both positive and negative labels. Easy enough to fix, but I would like to ask if we could replace this error with one that explicitly points out the issue. Would it be acceptable to have a check ahead of time on labels that ensures all labels are represented? Alternately, can we change the docs for BinaryClassificationEvaluator to explain what this error means?

      Attachments

        Activity

          People

            Unassigned Unassigned
            anzelpwj Paul Anzel
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: