Description
It would be nice to be able to use RF and GBT for feature transformation:
First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on the training set. Then each leaf of each tree in the ensemble is assigned a fixed arbitrary feature index in a new feature space. These leaf indices are then encoded in a one-hot fashion.
This method was first introduced by facebook(http://www.herbrich.me/papers/adclicksfacebook.pdf), and is implemented in famous libraries:
sklearn apply
xgboost [predict_leaf_index|https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py]
lightgbm predict_leaf_index
catboost calc_leaf_index
Refering to the design of above impls, I propose following api:
val model1 : DecisionTreeClassificationModel= ...
model1.setLeafCol("leaves")
model1.transform(df)
val model2 : GBTClassificationModel = ...
model2.getLeafCol
model2.transform(df)
The detailed design doc: https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing
Attachments
Issue Links
- Is contained by
-
SPARK-14047 GBT improvement umbrella
- Resolved
- links to