Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Not A Problem
-
1.2.0
-
None
-
Any
Description
The categories on each subset chosen to build a split on a categorical variable was not random. The categories for the subset are chosen based on the binary representation of a number from 1 to (2^(number of categories)) - 2 (excludes empty and full subset). On the current implementation, the integers used for the subsets are 1..numSplits. This should be random instead of biasing towards the categories with the lower indexes.
Another problem is that if numBins/2 is bigger than the possible subsets given a category set, it still considered the numSplits to be numBins/2. This should be the min of numBins/2 and (2^(number of categories)) - 2 (otherwise the same subsets might be considered more than once when choosing the splits).