Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Incomplete
-
2.3.0
-
None
-
Spark 2.3.0
ML
linux
Description
After doing some reading about gini and entropy based impurity (see https://spark.apache.org/docs/2.2.0/mllib-decision-tree.html) it seems that impurity values should always be bounded by 0 and 1. However, sometimes some leaf nodes (usually, but not always those with the minimum number of records) have negative impurity values (usually -1, but not always). This seems like bug in the impurity calculation, but I am not sure. This happens for both gini and entropy impurity at slightly different nodes.
I can reproduce this with almost any dataset using pretty standard parameters like the following:
new DecisionTreeClassifier()
.setLabelCol(targetName)
.setMaxBins(100)
.setMaxDepth(5)
.setMinInfoGain(0.01)
.setMinInstancesPerNode(5)