Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-3728

RandomForest: Learn models too large to store in memory

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Incomplete
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: MLlib
    • Labels:

      Description

      Proposal: Write trees to disk as they are learned.

      RandomForest currently uses a FIFO queue, which means training all trees at once via breadth-first search. Using a FILO queue would encourage the code to finish one tree before moving on to new ones. This would allow the code to write trees to disk as they are learned.

      Note: It would also be possible to write nodes to disk as they are learned using a FIFO queue, once the example--node mapping is cached [JIRA]. The [Sequoia Forest package]() does this. However, it could be useful to learn trees progressively, so that future functionality such as early stopping (training fewer trees than expected) could be supported.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                josephkb Joseph K. Bradley
              • Votes:
                0 Vote for this issue
                Watchers:
                14 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: