Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Incomplete
-
None
-
None
Description
Currently the RandomForest algo takes a single maxBins value to decide the number of splits to take. This sometimes causes training time to go very high when there is a single categorical column having sufficiently large number of unique values. This single column impacts all the numeric (continuous) columns even though such a high number of splits are not required.
Encoding the categorical column into features make the data very wide and this requires us to increase the maxMemoryInMB and puts more pressure on the GC as well.
Keeping the separate maxBins values for categorial and continuous features should be useful in this regard.