[SPARK-5688] Splits for Categorical Variables in DecisionTrees - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Not A Problem
Affects Version/s: 1.2.0
Fix Version/s: None
Component/s: MLlib
Labels:
- categorical
- decisiontree
Environment:

Any

Description

The categories on each subset chosen to build a split on a categorical variable was not random. The categories for the subset are chosen based on the binary representation of a number from 1 to (2^(number of categories)) - 2 (excludes empty and full subset). On the current implementation, the integers used for the subsets are 1..numSplits. This should be random instead of biasing towards the categories with the lower indexes.
Another problem is that if numBins/2 is bigger than the possible subsets given a category set, it still considered the numSplits to be numBins/2. This should be the min of numBins/2 and (2^(number of categories)) - 2 (otherwise the same subsets might be considered more than once when choosing the splits).

Attachments

Issue Links

links to

[Github] Pull Request #4475 (edenovit)

Activity

People

Assignee:: Unassigned

Reporter:: Eric Denovitzer

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 09/Feb/15 16:00

Updated:: 17/Feb/15 15:09

Resolved:: 17/Feb/15 15:09