[SPARK-23704] PySpark access of individual trees in random forest is slow - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Incomplete
Affects Version/s: 2.2.1
Fix Version/s: None
Component/s: ML, PySpark
Labels:
- bulk-closed
Environment:

PySpark 2.2.1 / Windows 10

Description

Making predictions from a randomForestClassifier PySpark is much faster than making predictions from an individual tree contained within the .trees attribute.

In fact, the model.transform call without an action is more than 10x slower for an individual tree vs the model.transform call for the random forest model.

See https://stackoverflow.com/questions/49297470/slow-individual-tree-access-for-random-forest-in-pyspark for example with timing.

Ideally:

Getting a prediction from a single tree should be comparable to or faster than getting predictions from the whole tree
Getting all the predictions from all the individual trees should be comparable in speed to getting the predictions from the random forest

Attachments

Issue Links

Is contained by

SPARK-14046 RandomForest improvement umbrella

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Julian King

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 16/Mar/18 00:56

Updated:: 08/Oct/19 05:43

Resolved:: 08/Oct/19 05:43