Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-14862

Tree and ensemble classification: do not require label metadata

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.0.0
    • ML
    • None

    Description

      spark.ml DecisionTreeClassifier, RandomForestClassifier, and GBTClassifier require that the labelCol have metadata specifying the number of classes. Instead, if the number of classes is not specified, we should automatically scan the column to identify numClasses.

      This differs from SPARK-7126 in that this requires labels to be indexed (but without metadata). This issue is not for supporting String labels.

      Note: This could cause problems with very small datasets + cross validation if there are k classes but class index k-1 does not appear in the training data. We should make sure the error thrown helps the user understand the solution, which is probably to use StringIndexer to index the whole dataset's labelCol before doing cross validation.

      Attachments

        Issue Links

          Activity

            People

              josephkb Joseph K. Bradley
              josephkb Joseph K. Bradley
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: