Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-8971

Support balanced class labels when splitting train/cross validation sets

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Incomplete
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: ML
    • Labels:

      Description

      CrossValidator and the proposed TrainValidatorSplit (SPARK-8484) are Spark classes which partition data into training and evaluation sets for performing hyperparameter selection via cross validation.

      Both methods currently perform the split by randomly sampling the datasets. However, when class probabilities are highly imbalanced (e.g. detection of extremely low-frequency events), random sampling may result in cross validation sets not representative of actual out-of-training performance (e.g. no positive training examples could be included).

      Mainstream R packages like already caret support splitting the data based upon the class labels.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                sethah Seth Hendrickson
                Reporter:
                fliang Feynman Liang
                Shepherd:
                Nicholas Pentreath
              • Votes:
                7 Vote for this issue
                Watchers:
                13 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: