Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19422

Cache input data in algorithms

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Not A Problem
    • Affects Version/s: 2.2.0
    • Fix Version/s: None
    • Component/s: ML
    • Labels:
      None

      Description

      Now some algorithms cache the input dataset if it was not cached any more StorageLevel.NONE:
      FeedForwardTrainer, LogisticRegression, OneVsRest, KMeans, AFTSurvivalRegression, IsotonicRegression, LinearRegression with non-WSL solver

      It maybe reasonable to cache input for others:
      DecisionTreeClassifier, GBTClassifier, RandomForestClassifier, LinearSVC
      BisectingKMeans, GaussianMixture, LDA
      DecisionTreeRegressor, GBTRegressor, GeneralizedLinearRegression with IRLS solver, RandomForestRegressor

      NaiveBayes is not included since it only make one pass on the data.
      MultilayerPerceptronClassifier is not included since the data is cached in FeedForwardTrainer.train

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                podongfeng zhengruifeng
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: