[SPARK-14606] Different maxBins value for categorical and continuous features in RandomForest implementation. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Incomplete
Affects Version/s: None
Fix Version/s: None
Component/s: ML, MLlib
Labels:
- bulk-closed

Description

Currently the RandomForest algo takes a single maxBins value to decide the number of splits to take. This sometimes causes training time to go very high when there is a single categorical column having sufficiently large number of unique values. This single column impacts all the numeric (continuous) columns even though such a high number of splits are not required.

Encoding the categorical column into features make the data very wide and this requires us to increase the maxMemoryInMB and puts more pressure on the GC as well.

Keeping the separate maxBins values for categorial and continuous features should be useful in this regard.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Rahul Tanwani

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 13/Apr/16 19:17

Updated:: 21/May/19 04:35

Resolved:: 21/May/19 04:35