Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23704

PySpark access of individual trees in random forest is slow

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Incomplete
    • 2.2.1
    • None
    • ML, PySpark
    • PySpark 2.2.1 / Windows 10

    Description

      Making predictions from a randomForestClassifier PySpark is much faster than making predictions from an individual tree contained within the .trees attribute. 

      In fact, the model.transform call without an action is more than 10x slower for an individual tree vs the model.transform call for the random forest model.

      See https://stackoverflow.com/questions/49297470/slow-individual-tree-access-for-random-forest-in-pyspark for example with timing.

      Ideally:

      • Getting a prediction from a single tree should be comparable to or faster than getting predictions from the whole tree
      • Getting all the predictions from all the individual trees should be comparable in speed to getting the predictions from the random forest

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              alpha137 Julian King
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: