[SPARK-13868] Random forest accuracy exploration - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: None
Fix Version/s: None
Component/s: ML
Labels:
- bulk-closed

Description

This is a JIRA for exploring accuracy improvements for Random Forests.

Background

Initial exploration was based on reports of poor accuracy from http://datascience.la/benchmarking-random-forest-implementations/

Essentially, Spark 1.2 showed poor performance relative to other libraries for training set sizes of 1M and 10M.

Initial improvements

The biggest issue was that the metric being used was AUC and Spark 1.2 was using hard predictions, not class probabilities. This was fixed in ~~SPARK-9528~~, and that brought Spark up to performance parity with scikit-learn, Vowpal Wabbit, and R for the training set size of 1M.

Remaining issues

For training set size 10M, Spark does not yet match the AUC of the other 2 libraries benchmarked (H2O and xgboost).

Note that, on 1M instances, these 2 libraries also show better results than scikit-learn, VW, and R. I'm not too familiar with the H2O implementation and how it differs, but xgboost is a very different algorithm, so it's not surprising it has different behavior.

My explorations

I've run Spark on the test set of 10M instances. (Note that the benchmark linked above used somewhat different settings for the different algorithms, but those settings are actually not that important for this problem. This included gini vs. entropy impurity and limits on splitting nodes.)

I've tried adjusting:

maxDepth: Past depth 20, going deeper does not seem to matter
maxBins: I've gone up to 500, but this too does not seem to matter. However, this is a hard thing to verify since slight differences in discretization could become significant in a large tree.

Current questions

H2O: It would be good to understand how this implementation differs from standard RF implementations (in R, VW, scikit-learn, and Spark).
xgboost: There's a JIRA for it: ~~SPARK-8547~~. It would be great to see the Spark package linked from that JIRA tested vs. MLlib on the benchmark data (or other data). From what I've heard/read, xgboost is sometimes better, sometimes worse in accuracy (but of course faster with more localized training).
Based on the above explorations, are there changes we should make to Spark RFs?

Attachments

Issue Links

Is contained by

SPARK-14046 RandomForest improvement umbrella

Resolved

is related to

SPARK-8547 xgboost exploration

Resolved

relates to

SPARK-15699 Add chi-squared test statistic as a split quality metric for decision trees

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Joseph K. Bradley

Votes:: 2 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 14/Mar/16 18:42

Updated:: 21/May/19 04:14

Resolved:: 21/May/19 04:14