Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20081

RandomForestClassifier doesn't seem to support more than 100 labels

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 2.1.0
    • None
    • ML, MLlib
    • None
    • Java

    Description

      When feeding data with more than 100 labels into RanfomForestClassifer#fit() (from java code), I get the following error message:

      Classifier inferred 143 from label values in column rfc_df0e968db9df__labelCol, but this exceeded the max numClasses (100) allowed to be inferred from values.  
        To avoid this error for labels with > 100 classes, specify numClasses explicitly in the metadata; this can be done by applying StringIndexer to the label column.
      

      Setting "numClasses" in the metadata for the label column doesn't make a difference. Looking at the code, this is not surprising, since MetadataUtils.getNumClasses() ignores this setting:

        def getNumClasses(labelSchema: StructField): Option[Int] = {
          Attribute.fromStructField(labelSchema) match {
            case binAttr: BinaryAttribute => Some(2)
            case nomAttr: NominalAttribute => nomAttr.getNumValues
            case _: NumericAttribute | UnresolvedAttribute => None
          }
        }
      

      The alternative would be to pass a proper "maxNumClasses" parameter to the classifier, so that Classifier#getNumClasses() allows a larger number of auto-detected labels. However, RandomForestClassifer#train() calls #getNumClasses without the "maxNumClasses" parameter, causing it to use the default of 100:

        override protected def train(dataset: Dataset[_]): RandomForestClassificationModel = {
          val categoricalFeatures: Map[Int, Int] =
            MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol)))
          val numClasses: Int = getNumClasses(dataset)
      // ...
      

      My scala skills are pretty sketchy, so please correct me if I misinterpreted something. But as it seems right now, there is no way to learn from data with more than 100 labels via RandomForestClassifier.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              creinig Christian Reiniger
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: