Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21535

Reduce memory requirement for CrossValidator and TrainValidationSplit

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 2.2.0
    • None
    • ML
    • None

    Description

      CrossValidator and TrainValidationSplit both use

      models = est.fit(trainingDataset, epm) 

      to fit the models, where epm is Array[ParamMap].

      Even though the training process is sequential, current implementation consumes extra driver memory for holding the trained models, which is not necessary and often leads to memory exception for both CrossValidator and TrainValidationSplit. My proposal is to optimize the training implementation, thus that used model can be collected by GC, and avoid the unnecessary OOM exceptions.

      E.g. when grid search space is 12, old implementation needs to hold all 12 trained models in the driver memory at the same time, while the new implementation only needs to hold 1 trained model at a time, and previous model can be cleared by GC.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              yuhaoyan yuhao yang
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: