Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19979

[MLLIB] Multiple Estimators/Pipelines In CrossValidator

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete CommentsDelete
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.1.0
    • None
    • MLlib

    Description

      Update CrossValidator and TrainValidationSplit to be able to accept multiple pipelines and grid parameters for testing different algorithms and/or being able to better control tuning combinations. Maintains backwards compatible API and reads legacy serialized objects.

      The same could be done using an external iterative approach. Build different pipelines, throwing each into a CrossValidator, and then taking the best model from each of those CrossValidators. Then finally picking the best from those. This is the initial approach I explored. It resulted in a lot of boiler plate code that felt like it shouldn't need to exist if the api simply allowed for arrays of estimators and their parameters.

      A couple advantages to this implementation to consider come from keeping the functional interface to the CrossValidator.

      1. The caching of the folds is better utilized. An external iterative approach creates a new set of k folds for each CrossValidator fit and the folds are discarded after each CrossValidator run. In this implementation a single set of k folds is created and cached for all of the pipelines.

      2. A potential advantage of using this implementation is for future parallelization of the pipelines within the CrossValdiator. It is of course possible to handle the parallelization outside of the CrossValidator here too, however I believe there is already work in progress to parallelize the grid parameters and that could be extended to multiple pipelines.

      Both of those behind-the-scene optimizations are possible because of providing the CrossValidator with the data and the complete set of pipelines/estimators to evaluate up front allowing one to abstract away the implementation.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned Assign to me
            dleifker David Leifker
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment