Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-13677

Support Tree-Based Feature Transformation for ML

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 3.0.0
    • ML
    • None

    Description

      It would be nice to be able to use RF and GBT for feature transformation:
      First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on the training set. Then each leaf of each tree in the ensemble is assigned a fixed arbitrary feature index in a new feature space. These leaf indices are then encoded in a one-hot fashion.

      This method was first introduced by facebook(http://www.herbrich.me/papers/adclicksfacebook.pdf), and is implemented in famous libraries:

      sklearn   apply

      xgboost  [predict_leaf_index|https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py]

      lightgbm predict_leaf_index

      catboost calc_leaf_index

       

       

      Refering to the design of above impls, I propose following api:

      val model1 : DecisionTreeClassificationModel= ...

      model1.setLeafCol("leaves")
      model1.transform(df)

       

      val model2 : GBTClassificationModel = ...

      model2.getLeafCol
      model2.transform(df)

       

       The detailed design doc: https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing

      Attachments

        Issue Links

          Activity

            People

              podongfeng Ruifeng Zheng
              podongfeng Ruifeng Zheng
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: