[SPARK-34591] Pyspark undertakes pruning of decision trees and random forests outside the control of the user, leading to undesirable and unexpected outcomes that are challenging to diagnose and impossible to correct - ASF JIRA

Details

Type: Bug
Status: In Progress
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.4.0, 2.4.4, 3.1.1
Fix Version/s: None
Component/s: ML
Labels:
- pull-request-available
- pyspark

Description

History of the issue

~~SPARK-3159~~ implemented a method designed to reduce the computational burden for predictions from decision trees and random forests by pruning the tree after fitting. This is done in such a way that branches where child leaves all produce the same classification prediction are merged.

This was implemented via a PR: https://github.com/apache/spark/pull/20632

This feature is controllable by a "prune" parameter in the Scala version of the code, which is set to True as the default behaviour. However, this parameter is not exposed in the Pyspark API, resulting in the behaviour above:

Occurring always (despite the user may not wanting it to occur)
Not being documented in the ML documentation, leading to decision tree behavoiur that may be in conflict with what the user expects to happen

Why is this a problem?

Problem 1: Inaccurate probabilities

Because the decision to prune is based on the classification prediction from the tree (not the probability prediction from the node), this introduces additional bias compared to the situation where the pruning is not done. The impact here may be severe in some cases

Problem 2: Leads to completely unacceptable behaviours in some circumstances and for some hyper-parameters

My colleagues and I encountered this bug in a scenario where we could not get a decision tree classifier (or random forest classifier with a single tree) to split a single node, despite this being eminently supported by the data. This renders the decision trees and random forests complete unusable

Problem 3: Outcomes are highly sensitive to the hyper-parameters chosen, and how they interact with the data

Small changes in the hyper-parameters should ideally produce small changes in the built trees. However, here we have found that small changes in the hyper-parameters lead to large and unpredictable changes in the resultant trees as a result of this pruning.

In principle, this high degree of instability means that re-training the same model, with the same hyper-parameter settings, on slightly different data may lead to large variations in the tree structure simply as a result of the pruning

Problem 4: The problems above are much worse for unbalanced data sets

Probability estimation on unbalanced data sets using trees should be supported, but the pruning method described will make this very difficult

Problem 5: This pruning method is a substantial variation from the description of the decision tree algorithm in the MLLib documents and is not described

This made it extremely confusing for us in working out why we were seeing certain behaviours - we had to trace back through all of the Spark detailed release notes to identify where the problem might.

Proposed solutions

Option 1 (much easier):

The proposed solution here is:

Set the default pruning behaviour to False rather than True, thereby bringing the default behaviour back into alignment with the documentation whilst avoiding the issues described above

Option 2 (more involved):

The proposed solution here is:

Leave the default pruning behaviour set to False
Expand the pyspark API to expose the pruning behaviour as a user-controllable option
Document the change to the API
Document the change to the tree building behaviour at appropriate points in the Spark ML and Spark MLLib documentation

We recommend that the default behaviour be set to False because this approach is not the generally understood approach for building decision trees, where pruning is decided a separate and user-controllable step.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Reproducible example of Spark bug - no 2.pdf
08/Jun/21 12:37
423 kB
Julian King
Reproducible example of Spark bug.pdf
08/Jun/21 00:09
119 kB
Julian King

Issue Links

is caused by

SPARK-3159 Check for reducible DecisionTree

Resolved

Is contained by

SPARK-14045 DecisionTree improvement umbrella

Resolved

is related to

SPARK-14046 RandomForest improvement umbrella

Resolved

links to

[Github] Pull Request #32813 (CBribiescas)

GitHub Pull Request #32813

(1 links to)

Pyspark undertakes pruning of decision trees and random forests outside the control of the user, leading to undesirable and unexpected outcomes that are challenging to diagnose and impossible to correct

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates