Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23704

PySpark access of individual trees in random forest is slow

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete CommentsDelete
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Incomplete
    • 2.2.1
    • None
    • ML, PySpark
    • PySpark 2.2.1 / Windows 10

    Description

      Making predictions from a randomForestClassifier PySpark is much faster than making predictions from an individual tree contained within the .trees attribute. 

      In fact, the model.transform call without an action is more than 10x slower for an individual tree vs the model.transform call for the random forest model.

      See https://stackoverflow.com/questions/49297470/slow-individual-tree-access-for-random-forest-in-pyspark for example with timing.

      Ideally:

      • Getting a prediction from a single tree should be comparable to or faster than getting predictions from the whole tree
      • Getting all the predictions from all the individual trees should be comparable in speed to getting the predictions from the random forest

       

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned Assign to me
            alpha137 Julian King
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment