Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16957

Use weighted midpoints for split values.

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 2.3.0
    • MLlib
    • None

    Description

      We should be using weighted split points rather than the actual continuous binned feature values. For instance, in a dataset containing binary features (that are fed in as continuous ones), our splits are selected as x <= 0.0 and x > 0.0. For any real data with some smoothness qualities, this is asymptotically bad compared to GBM's approach. The split point should be a weighted split point of the two values of the "innermost" feature bins; e.g., if there are 30 x = 0 and 10 x = 1, the above split should be at 0.75.

      Example:

      +--------+--------+-----+-----+
      |feature0|feature1|label|count|
      +--------+--------+-----+-----+
      |     0.0|     0.0|  0.0|   23|
      |     1.0|     0.0|  0.0|    2|
      |     0.0|     0.0|  1.0|    2|
      |     0.0|     1.0|  0.0|    7|
      |     1.0|     0.0|  1.0|   23|
      |     0.0|     1.0|  1.0|   18|
      |     1.0|     1.0|  1.0|    7|
      |     1.0|     1.0|  0.0|   18|
      +--------+--------+-----+-----+
      
      DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes
        If (feature 0 <= 0.0)
         If (feature 1 <= 0.0)
          Predict: -0.56
         Else (feature 1 > 0.0)
          Predict: 0.29333333333333333
        Else (feature 0 > 0.0)
         If (feature 1 <= 0.0)
          Predict: 0.56
         Else (feature 1 > 0.0)
          Predict: -0.29333333333333333
      

      Attachments

        Issue Links

          Activity

            People

              facai Yan Facai (颜发才)
              vlad.feinberg Vladimir Feinberg
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: