[SPARK-3728] RandomForest: Learn models too large to store in memory - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Incomplete
Affects Version/s: None
Fix Version/s: None
Component/s: MLlib
Labels:
- bulk-closed

Description

Proposal: Write trees to disk as they are learned.

RandomForest currently uses a FIFO queue, which means training all trees at once via breadth-first search. Using a FILO queue would encourage the code to finish one tree before moving on to new ones. This would allow the code to write trees to disk as they are learned.

Note: It would also be possible to write nodes to disk as they are learned using a FIFO queue, once the example--node mapping is cached [JIRA]. The [Sequoia Forest package]() does this. However, it could be useful to learn trees progressively, so that future functionality such as early stopping (training fewer trees than expected) could be supported.

Attachments

Issue Links

Is contained by

SPARK-14046 RandomForest improvement umbrella

Resolved

relates to

SPARK-13434 Reduce Spark RandomForest memory footprint

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Joseph K. Bradley

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 29/Sep/14 18:41

Updated:: 21/May/19 04:17

Resolved:: 21/May/19 04:17