Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-13677

Support Tree-Based Feature Transformation for ML

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.0.0
    • Component/s: ML
    • Labels:
      None

      Description

      It would be nice to be able to use RF and GBT for feature transformation:
      First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on the training set. Then each leaf of each tree in the ensemble is assigned a fixed arbitrary feature index in a new feature space. These leaf indices are then encoded in a one-hot fashion.

      This method was first introduced by facebook(http://www.herbrich.me/papers/adclicksfacebook.pdf), and is implemented in famous libraries:

      sklearn   apply

      xgboost  [predict_leaf_index|https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py]

      lightgbm predict_leaf_index

      catboost calc_leaf_index

       

       

      Refering to the design of above impls, I propose following api:

      val model1 : DecisionTreeClassificationModel= ...

      model1.setLeafCol("leaves")
      model1.transform(df)

       

      val model2 : GBTClassificationModel = ...

      model2.getLeafCol
      model2.transform(df)

       

       The detailed design doc: https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                podongfeng zhengruifeng
                Reporter:
                podongfeng zhengruifeng
              • Votes:
                1 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: