[SPARK-20081] RandomForestClassifier doesn't seem to support more than 100 labels - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: 2.1.0
Fix Version/s: None
Component/s: ML, MLlib
Labels:
None
Environment:

Java

Description

When feeding data with more than 100 labels into RanfomForestClassifer#fit() (from java code), I get the following error message:

Classifier inferred 143 from label values in column rfc_df0e968db9df__labelCol, but this exceeded the max numClasses (100) allowed to be inferred from values.  
  To avoid this error for labels with > 100 classes, specify numClasses explicitly in the metadata; this can be done by applying StringIndexer to the label column.

Setting "numClasses" in the metadata for the label column doesn't make a difference. Looking at the code, this is not surprising, since MetadataUtils.getNumClasses() ignores this setting:

  def getNumClasses(labelSchema: StructField): Option[Int] = {
    Attribute.fromStructField(labelSchema) match {
      case binAttr: BinaryAttribute => Some(2)
      case nomAttr: NominalAttribute => nomAttr.getNumValues
      case _: NumericAttribute | UnresolvedAttribute => None
    }
  }

The alternative would be to pass a proper "maxNumClasses" parameter to the classifier, so that Classifier#getNumClasses() allows a larger number of auto-detected labels. However, RandomForestClassifer#train() calls #getNumClasses without the "maxNumClasses" parameter, causing it to use the default of 100:

  override protected def train(dataset: Dataset[_]): RandomForestClassificationModel = {
    val categoricalFeatures: Map[Int, Int] =
      MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol)))
    val numClasses: Int = getNumClasses(dataset)
// ...

My scala skills are pretty sketchy, so please correct me if I misinterpreted something. But as it seems right now, there is no way to learn from data with more than 100 labels via RandomForestClassifier.

Attachments

Issue Links

Is contained by

SPARK-14046 RandomForest improvement umbrella

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Christian Reiniger

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 24/Mar/17 11:37

Updated:: 26/Apr/17 07:48

Resolved:: 19/Apr/17 17:05