Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Not A Problem
-
1.4.0
-
None
-
None
Description
In https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala there's a statement that sets maxPossibileBins to numExamples when numExamples is less than strategy.maxBins.
This can cause an error when training small partitions; the error is triggered further down in the logic where it's required that maxCategoriesPerFeature be less than or equal to maxPossibleBins.
Here's the an example of how it was manifested: the partition contained 49 rows (i.e., numExamples=49 but strategy.maxBins was 57.
The maxPossibleBins = math.min(strategy.maxBins, numExamples) logic therefore reduced maxPossibleBins to 49 causing the "require(maxCategoriesPerFeature <= maxPossibleBins" to throw an error.
In short, this will be a problem when training small datasets with a feature that contains more categories than numExamples.
In our local testing we commented out the "math.min(strategy.maxBins, numExamples)" line and the decision tree succeeded where it had failed previously.
Attachments
Issue Links
- is related to
-
SPARK-9077 Improve error message for decision trees when numExamples < maxCategoriesPerFeature
- Resolved