Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-3155

Support DecisionTree pruning

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Incomplete
    • None
    • None
    • MLlib

    Description

      Improvement: accuracy, computation

      Summary: Pruning is a common method for preventing overfitting with decision trees. A smart implementation can prune the tree during training in order to avoid training parts of the tree which would be pruned eventually anyways. DecisionTree does not currently support pruning.

      Pruning: A “pruning” of a tree is a subtree with the same root node, but with zero or more branches removed.
      A naive implementation prunes as follows:
      (1) Train a depth K tree using a training set.
      (2) Compute the optimal prediction at each node (including internal nodes) based on the training set.
      (3) Take a held-out validation set, and use the tree to make predictions for each validation example. This allows one to compute the validation error made at each node in the tree (based on the predictions computed in step (2).)
      (4) For each pair of leafs with the same parent, compare the total error on the validation set made by the leafs’ predictions with the error made by the parent’s predictions. Remove the leafs if the parent has lower error.

      A smarter implementation prunes during training, computing the error on the validation set made by each node as it is trained. Whenever two children increase the validation error, they are pruned, and no more training is required on that branch.

      It is common to use about 1/3 of the data for pruning. Note that pruning is important when using a tree directly for prediction. It is less important when combining trees via ensemble methods.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              josephkb Joseph K. Bradley
              Votes:
              2 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: