[SPARK-16957] Use weighted midpoints for split values. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.3.0
Component/s: MLlib
Labels:
None

Description

We should be using weighted split points rather than the actual continuous binned feature values. For instance, in a dataset containing binary features (that are fed in as continuous ones), our splits are selected as x <= 0.0 and x > 0.0. For any real data with some smoothness qualities, this is asymptotically bad compared to GBM's approach. The split point should be a weighted split point of the two values of the "innermost" feature bins; e.g., if there are 30 x = 0 and 10 x = 1, the above split should be at 0.75.

Example:

+--------+--------+-----+-----+
|feature0|feature1|label|count|
+--------+--------+-----+-----+
|     0.0|     0.0|  0.0|   23|
|     1.0|     0.0|  0.0|    2|
|     0.0|     0.0|  1.0|    2|
|     0.0|     1.0|  0.0|    7|
|     1.0|     0.0|  1.0|   23|
|     0.0|     1.0|  1.0|   18|
|     1.0|     1.0|  1.0|    7|
|     1.0|     1.0|  0.0|   18|
+--------+--------+-----+-----+

DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes
  If (feature 0 <= 0.0)
   If (feature 1 <= 0.0)
    Predict: -0.56
   Else (feature 1 > 0.0)
    Predict: 0.29333333333333333
  Else (feature 0 > 0.0)
   If (feature 1 <= 0.0)
    Predict: 0.56
   Else (feature 1 > 0.0)
    Predict: -0.29333333333333333

Attachments

Issue Links

Is contained by

SPARK-14045 DecisionTree improvement umbrella

Resolved

links to

[Github] Pull Request #17556 (facaiy)

Activity

People

Assignee:: Yan Facai (颜发才)

Reporter:: Vladimir Feinberg

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 08/Aug/16 20:57

Updated:: 03/May/17 09:55

Resolved:: 03/May/17 09:55