Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16957

Use weighted midpoints for split values.

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.3.0
    • Component/s: MLlib
    • Labels:
      None

      Description

      We should be using weighted split points rather than the actual continuous binned feature values. For instance, in a dataset containing binary features (that are fed in as continuous ones), our splits are selected as x <= 0.0 and x > 0.0. For any real data with some smoothness qualities, this is asymptotically bad compared to GBM's approach. The split point should be a weighted split point of the two values of the "innermost" feature bins; e.g., if there are 30 x = 0 and 10 x = 1, the above split should be at 0.75.

      Example:

      +--------+--------+-----+-----+
      |feature0|feature1|label|count|
      +--------+--------+-----+-----+
      |     0.0|     0.0|  0.0|   23|
      |     1.0|     0.0|  0.0|    2|
      |     0.0|     0.0|  1.0|    2|
      |     0.0|     1.0|  0.0|    7|
      |     1.0|     0.0|  1.0|   23|
      |     0.0|     1.0|  1.0|   18|
      |     1.0|     1.0|  1.0|    7|
      |     1.0|     1.0|  0.0|   18|
      +--------+--------+-----+-----+
      
      DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes
        If (feature 0 <= 0.0)
         If (feature 1 <= 0.0)
          Predict: -0.56
         Else (feature 1 > 0.0)
          Predict: 0.29333333333333333
        Else (feature 0 > 0.0)
         If (feature 1 <= 0.0)
          Predict: 0.56
         Else (feature 1 > 0.0)
          Predict: -0.29333333333333333
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                facai Yan Facai (颜发才)
                Reporter:
                vlad.feinberg Vladimir Feinberg
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: