Details

    • Type: Sub-task
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.5.0
    • Component/s: MLlib
    • Labels:
      None

      Description

      Summary: Create a class hierarchy for learning algorithms and the models those algorithms produce.

      This is a super-task of several sub-tasks (but JIRA does not allow subtasks of subtasks). See the "requires" links below for subtasks.

      Goals:

      • give intuitive structure to API, both for developers and for generated documentation
      • support meta-algorithms (e.g., boosting)
      • support generic functionality (e.g., evaluation)
      • reduce code duplication across classes

      Design doc for class hierarchy

        Issue Links

          Activity

          Hide
          josephkb Joseph K. Bradley added a comment -

          I'm closing this and calling it fixed. This JIRA was scoped a bit too large, and I'm cutting it off to limit it to classification and regression abstractions. I've created a new umbrella SPARK-10817 under which we can do further work on abstractions.

          Show
          josephkb Joseph K. Bradley added a comment - I'm closing this and calling it fixed. This JIRA was scoped a bit too large, and I'm cutting it off to limit it to classification and regression abstractions. I've created a new umbrella SPARK-10817 under which we can do further work on abstractions.
          Hide
          avulanov Alexander Ulanov added a comment -

          Joseph K. Bradley Hi, Joseph! Do you plan to add support for multivariate regression? I need this for multi-layer perceptron. Multivariate regression interface might be useful for other tasks. I've added an issue https://issues.apache.org/jira/browse/SPARK-9120. Also I wonder if you plan to add integer array parameters: https://issues.apache.org/jira/browse/SPARK-9118. Both seems to be relatively easy to implement, the question is do you plan to merge these features in the near future?

          Show
          avulanov Alexander Ulanov added a comment - Joseph K. Bradley Hi, Joseph! Do you plan to add support for multivariate regression? I need this for multi-layer perceptron. Multivariate regression interface might be useful for other tasks. I've added an issue https://issues.apache.org/jira/browse/SPARK-9120 . Also I wonder if you plan to add integer array parameters: https://issues.apache.org/jira/browse/SPARK-9118 . Both seems to be relatively easy to implement, the question is do you plan to merge these features in the near future?
          Hide
          josephkb Joseph K. Bradley added a comment -

          I would call it a sub-task, but we have not yet addressed it. Spark 1.4 won't include clustering in the Pipelines API, I'm afraid, but I'd like to get started adding it ASAP so it can hopefully be in 1.5. I'll make a JIRA and link it here.

          Show
          josephkb Joseph K. Bradley added a comment - I would call it a sub-task, but we have not yet addressed it. Spark 1.4 won't include clustering in the Pipelines API, I'm afraid, but I'd like to get started adding it ASAP so it can hopefully be in 1.5. I'll make a JIRA and link it here.
          Hide
          rajao Jao Rabary added a comment - - edited

          Are unsupervised learning algorithms also concerned with this standardization ? I would like to use algorithm such as kmeans with ml pipelines. How can one get started with that ?

          Show
          rajao Jao Rabary added a comment - - edited Are unsupervised learning algorithms also concerned with this standardization ? I would like to use algorithm such as kmeans with ml pipelines. How can one get started with that ?
          Hide
          josephkb Joseph K. Bradley added a comment -

          That's supported via the "rawPredictions" output column in the org.apache.spark.ml.classification.Classifier abstraction. As we add wrappers to spark.mllib algorithms in the spark.ml package, we can have classifiers which have confidences output this column. Classifiers which don't yet support confidences can be subclasses of Predictor instead of Classifier. But yes, it will be important to support more and more!

          Show
          josephkb Joseph K. Bradley added a comment - That's supported via the "rawPredictions" output column in the org.apache.spark.ml.classification.Classifier abstraction. As we add wrappers to spark.mllib algorithms in the spark.ml package, we can have classifiers which have confidences output this column. Classifiers which don't yet support confidences can be subclasses of Predictor instead of Classifier. But yes, it will be important to support more and more!
          Hide
          derenrich Daniel Erenrich added a comment -

          One thing I'm interested in along these lines is a standard interface/method for getting confidence scores from models. Currently I cannot write code that generically accepts a model that can give me the probability that its prediction is correct. There are many use cases where you would want to be able to handle such models but there doesn't appear to be a standard way to get that information despite the fact that many of the models already support this functionality one way or another.

          Show
          derenrich Daniel Erenrich added a comment - One thing I'm interested in along these lines is a standard interface/method for getting confidence scores from models. Currently I cannot write code that generically accepts a model that can give me the probability that its prediction is correct. There are many use cases where you would want to be able to handle such models but there doesn't appear to be a standard way to get that information despite the fact that many of the models already support this functionality one way or another.
          Hide
          josephkb Joseph K. Bradley added a comment -

          Using Vector types is better since they store values as Array[Double], which avoids creating an object for every value. If you're thinking about feature names/metadata, the Metadata capability in DataFrame will be able to handle metadata for each feature in Vector columns.

          Show
          josephkb Joseph K. Bradley added a comment - Using Vector types is better since they store values as Array [Double] , which avoids creating an object for every value. If you're thinking about feature names/metadata, the Metadata capability in DataFrame will be able to handle metadata for each feature in Vector columns.
          Hide
          prudenko Peter Rudenko added a comment - - edited

          For trees based algorithms curious whether there would be performance benefit (assuming reimplementation of Decision tree) by passing directly Dataframe columns rather than single column with vector type. E.g.:

          class GBT extends Estimator with HasInputCols
          
          val model = new GBT.setInputCols("col1","col2", "col3, ...)
          

          and split dataset using dataframe api.

          Show
          prudenko Peter Rudenko added a comment - - edited For trees based algorithms curious whether there would be performance benefit (assuming reimplementation of Decision tree) by passing directly Dataframe columns rather than single column with vector type. E.g.: class GBT extends Estimator with HasInputCols val model = new GBT.setInputCols( "col1" , "col2" , "col3, ...) and split dataset using dataframe api.
          Hide
          josephkb Joseph K. Bradley added a comment -

          I'm canceling my WIP PR for this since I have begun breaking that PR into smaller PRs.
          The WIP PR branch is in my ml-api branch .

          Here's the description of the WIP PR:

          This is WIP effort to standardize abstractions and developer API for prediction tasks (classification and regression) for the new ML api (org.apache.spark.ml).

          • Please comment on:
            • abstractions, class hierarchy
            • functionality required by each abstraction
            • naming of types and methods
            • ease of use for developers
            • ease of use for users migrating from org.apache.spark.mllib
          • Please ignore for now:
            • missing tests and examples
            • private/public API (I will make more things private to ml after writing tests and examples.)
            • style and other details
            • the many TODO items noted in the code

          Please refer to https://issues.apache.org/jira/browse/SPARK-3702 for some discussion on design, and this design doc for major design decisions.

          This is not intended to cover all algorithms; e.g., one big missing item is porting the GeneralizedLinearModel class to the new API. But it hopefully lays a fair amount of groundwork.

          I have included a limited number of concrete classes in this WIP PR, for purposes of illustration:

          • LogisticRegression (edited, to show effects of abstract classes)
          • NaiveBayes (simple to show ease of use for developers)
          • AdaBoost (demonstration of meta-algorithms taking advantage of abstractions)
            • (Note discussion of strong vs. weak types for ensemble methods in design doc.)
            • This implementation is very incomplete but illustrates using the abstractions.
          • LinearRegression (example of Regressor, for completeness)
          • evaluators (to provide default evaluators in the class hierarchy)
          • IterativeSolver and IterativeEstimator (to expose iterative algorithms)
          • LabeledPoint (Q: Should this include an instance weight?)

          Items remaining:

          • [ ] helper method for simulating a distribution over weighted instances by subsampling (for algorithms which do not support instance weights)
          • [ ] several TODO items noted in the code
          • [ ] add tests and examples
          • [ ] general cleanup
          • [ ] make more of hierarchy private to ml
          • [ ] split into several smaller PRs

          General plan for splitting into multiple PRs, in order:
          1. Simple class hierarchy
          2. Evaluators
          3. IterativeEstimator
          4. AdaBoost
          5. NaiveBayes (Any time after Evaluators)

          Thanks to @epahomov and @BigCrunsh for input, including from https://github.com/apache/spark/pull/2137 which improves upon the org.apache.spark.mllib APIs.

          Show
          josephkb Joseph K. Bradley added a comment - I'm canceling my WIP PR for this since I have begun breaking that PR into smaller PRs. The WIP PR branch is in my ml-api branch . Here's the description of the WIP PR: This is WIP effort to standardize abstractions and developer API for prediction tasks (classification and regression) for the new ML api (org.apache.spark.ml). Please comment on: abstractions, class hierarchy functionality required by each abstraction naming of types and methods ease of use for developers ease of use for users migrating from org.apache.spark.mllib Please ignore for now: missing tests and examples private/public API (I will make more things private to ml after writing tests and examples.) style and other details the many TODO items noted in the code Please refer to https://issues.apache.org/jira/browse/SPARK-3702 for some discussion on design, and this design doc for major design decisions. This is not intended to cover all algorithms; e.g., one big missing item is porting the GeneralizedLinearModel class to the new API. But it hopefully lays a fair amount of groundwork. I have included a limited number of concrete classes in this WIP PR, for purposes of illustration: LogisticRegression (edited, to show effects of abstract classes) NaiveBayes (simple to show ease of use for developers) AdaBoost (demonstration of meta-algorithms taking advantage of abstractions) (Note discussion of strong vs. weak types for ensemble methods in design doc.) This implementation is very incomplete but illustrates using the abstractions. LinearRegression (example of Regressor, for completeness) evaluators (to provide default evaluators in the class hierarchy) IterativeSolver and IterativeEstimator (to expose iterative algorithms) LabeledPoint (Q: Should this include an instance weight?) Items remaining: [ ] helper method for simulating a distribution over weighted instances by subsampling (for algorithms which do not support instance weights) [ ] several TODO items noted in the code [ ] add tests and examples [ ] general cleanup [ ] make more of hierarchy private to ml [ ] split into several smaller PRs General plan for splitting into multiple PRs, in order: 1. Simple class hierarchy 2. Evaluators 3. IterativeEstimator 4. AdaBoost 5. NaiveBayes (Any time after Evaluators) Thanks to @epahomov and @BigCrunsh for input, including from https://github.com/apache/spark/pull/2137 which improves upon the org.apache.spark.mllib APIs.
          Hide
          josephkb Joseph K. Bradley added a comment -

          APIs for Classifiers, Regressors

          Show
          josephkb Joseph K. Bradley added a comment - APIs for Classifiers, Regressors
          Hide
          apachespark Apache Spark added a comment -

          User 'jkbradley' has created a pull request for this issue:
          https://github.com/apache/spark/pull/3427

          Show
          apachespark Apache Spark added a comment - User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/3427
          Hide
          BigCrunsh Christoph Sawade added a comment -

          Okay. I will follow it.

          Show
          BigCrunsh Christoph Sawade added a comment - Okay. I will follow it.
          Hide
          josephkb Joseph K. Bradley added a comment -

          Thanks for taking a close look!

          • Abstraction of Multilabel
            Things definitely get more complex with multiple labels, and it is not clear to me the best way to handle it. I agree it would not make sense to have a whole bunch of types of the different combinations of multiple labels. Perhaps the abstraction should be MultilabelEstimator, which can predict any combination of categories and/or real values.
            • Note: It should not be a list of Estimators since proper multilabel prediction would do joint prediction, rather than predicting each label separately.
          • Model-based vs. memory-based
            Would these two concepts affect the public API? I don't think they would, but do you have an example for why there should be a shared abstract class?
            • For k-nearest-neighbors, I think the same Classifier and Classifier.Model abstraction would work. The Classifier would ideally compute some nice data structure for finding nearest neighbors, and the Model would store that data structure (or the original dataset for a very basic implementation).
          • Model vs. Estimator Abstraction
            I think you're bringing up an important point about public vs. developer interfaces. Here's what I mean:
            • Public interfaces: For most users, the functionality is the most important aspect. E.g., most users need to know they are using a Classifier, regardless of whether it is a DecisionTree or a GLM.
            • Developer (private[mllib]) interfaces: For developers, abstractions such as DecisionTree and GLM are very important.
            • Proposal: As part of the "Standardize MLlib interfaces," I hope to first clarify the public interfaces and decide what interfaces need to be exposed. As needed, we can work on improving the developer interfaces for specific groups of algorithms.
              • For this, the [JIRA on clarifying GLM interfaces https://issues.apache.org/jira/browse/SPARK-3251] seems like an important one, but it may be blocked by updates to the public MLlib API.

          Does that sound reasonable?

          With respect to traits vs. abstract classes, I agree it may be good to keep the lightweight public interfaces be defined as traits as much as possible.

          Almost done with initial prototype code, and will post that soon.

          Show
          josephkb Joseph K. Bradley added a comment - Thanks for taking a close look! Abstraction of Multilabel Things definitely get more complex with multiple labels, and it is not clear to me the best way to handle it. I agree it would not make sense to have a whole bunch of types of the different combinations of multiple labels. Perhaps the abstraction should be MultilabelEstimator, which can predict any combination of categories and/or real values. Note: It should not be a list of Estimators since proper multilabel prediction would do joint prediction, rather than predicting each label separately. Model-based vs. memory-based Would these two concepts affect the public API? I don't think they would, but do you have an example for why there should be a shared abstract class? For k-nearest-neighbors, I think the same Classifier and Classifier.Model abstraction would work. The Classifier would ideally compute some nice data structure for finding nearest neighbors, and the Model would store that data structure (or the original dataset for a very basic implementation). Model vs. Estimator Abstraction I think you're bringing up an important point about public vs. developer interfaces. Here's what I mean: Public interfaces: For most users, the functionality is the most important aspect. E.g., most users need to know they are using a Classifier, regardless of whether it is a DecisionTree or a GLM. Developer (private [mllib] ) interfaces: For developers, abstractions such as DecisionTree and GLM are very important. Proposal: As part of the "Standardize MLlib interfaces," I hope to first clarify the public interfaces and decide what interfaces need to be exposed. As needed, we can work on improving the developer interfaces for specific groups of algorithms. For this, the [JIRA on clarifying GLM interfaces https://issues.apache.org/jira/browse/SPARK-3251] seems like an important one, but it may be blocked by updates to the public MLlib API. Does that sound reasonable? With respect to traits vs. abstract classes, I agree it may be good to keep the lightweight public interfaces be defined as traits as much as possible. Almost done with initial prototype code, and will post that soon.
          Hide
          BigCrunsh Christoph Sawade added a comment - - edited

          Great initiative. I really appreciate the attempt to standardize and identify common interfaces. Currently, I have three issues:

          • Abstraction of Multilabel
            The distinguish between classification and regression seems to be natural and also the abstraction of a multi-label makes sense to me. The simplest multi-label approach that I can think of is a collection of binary classifiers. Do you plan to support also mixtures of multi-labels (regression / multinomial classification)? If so, does it makes sense to distinguish between ``MultilabelClassifier`` and ``MultilabelRegressor``? Isn't it then just a list of Estimators?
          • Model-based vs. memory-based
            I am wondering if it is worth to distinguish between memory-based (e.g., k-nearest neighbour, kernel-machines,...) and model-based predictions (Decision trees, NN, Naive Bayes, GLMs)? Or more general, how does k-nearest neighbour fit into that framework?
          • Model vs. Estimator Abstraction
            Currently, the main distinction is between classification and regression. However, many methods are similar because they have the same modelling rather than they have the same prediction type. I am wondering how the functional similarities can be reflected in that hierarchy. I tried to follow a bottom-up approach and applied these abstractions to different learning methods. Here are two examples:

          Decision trees are trained with some recursive algorithm as ID3 or C4.5 and the predicition is obtained by traversing the tree. The difference between classification and regression plays rather a minor role. So, intuitively, there is a DecisionTree estimator that can be, e.g., ID3 or C4.5. Then, the DecisionTreeClassifier is a DecisionTree with classification criteria; it returns a DecisionTree.Model (the tree) with a predictClass function (Classifier.Model?). The DecisionTreeRegresser is a DecisionTree with regression criteria and it returns a DecisionTree.Model with a predictScore function (Regressor.Model?). Formally, it looks like

          • DecisionTree extends Estimator
          • DecisionTreeClassifier extends DecisionTree with Classifier
          • DecisionTreeRegressor extends DecisionTree with Regressor
          • DecisionTree.Model extends Transformer
          • DecisionTreeClassifier.Model extends DecisionTree.Model with Classifier.Model
          • DecisionTreeRegressor.Model extends DecisionTree.Model with Regressor.Model

          Methods like LogReg, SVM, RidgeRegression, ... maintain a weight vector (one probably could summarize them to GLMs). The inner product with the example vector results naturally in a regression score for each prediction; a binary classification is then derived by thresholding that score. The underlying optimization problem for all consists of a sum over loss functions and a regularization term (regularized empirical risk minimization) that can be solved by different solvers, e.g., SGD, LBFGS... So to exploit this structure, I would expect something like this:

          • RegularizedEmpiricalRiskMinimizer extends Estimator
            // LogisticRegression and SupportVectorMachine could be an automatic selection between the binomial and multinomial version
          • BinomialLogisticRegression extends RegularizedEmpiricalRiskMinimizer
          • MultinomialLogisticRegression extends RegularizedEmpiricalRiskMinimizer
          • BinomialSupportVectorMachine extends RegularizedEmpiricalRiskMinimizer
          • RidgeRegression extends RegularizedEmpiricalRiskMinimizer
          • LinearModel extends Transformer
          • BinomialLinearModel extends LinearModel with Classifier.Model
          • MultinomialLinearModel extends LinearModel with Classifier.Model
          • BinomialLogisticRegression.Model extends BinomialLinearModel with ProbabilisticClassificationModel
          • MultinomialLogisticRegression.Model extends MultinomialLinearModel with ProbabilisticClassificationModel
          • BinomialSupportVectorMachine.Model extends BinomialLinearModel // actually it is a binomial linear model
          • RidgeRegression.Model extends LinearModel // actually it is a linear model

          So isn't the Classifier.Model more a trait than an abstract class? Perhaps, I just missed something, but I think it is helpful to consider the interfaces for specific instances. I am really interested in discussing the pros/cons.

          Show
          BigCrunsh Christoph Sawade added a comment - - edited Great initiative. I really appreciate the attempt to standardize and identify common interfaces. Currently, I have three issues: Abstraction of Multilabel The distinguish between classification and regression seems to be natural and also the abstraction of a multi-label makes sense to me. The simplest multi-label approach that I can think of is a collection of binary classifiers. Do you plan to support also mixtures of multi-labels (regression / multinomial classification)? If so, does it makes sense to distinguish between ``MultilabelClassifier`` and ``MultilabelRegressor``? Isn't it then just a list of Estimators? Model-based vs. memory-based I am wondering if it is worth to distinguish between memory-based (e.g., k-nearest neighbour, kernel-machines,...) and model-based predictions (Decision trees, NN, Naive Bayes, GLMs)? Or more general, how does k-nearest neighbour fit into that framework? Model vs. Estimator Abstraction Currently, the main distinction is between classification and regression. However, many methods are similar because they have the same modelling rather than they have the same prediction type. I am wondering how the functional similarities can be reflected in that hierarchy. I tried to follow a bottom-up approach and applied these abstractions to different learning methods. Here are two examples: Decision trees are trained with some recursive algorithm as ID3 or C4.5 and the predicition is obtained by traversing the tree. The difference between classification and regression plays rather a minor role. So, intuitively, there is a DecisionTree estimator that can be, e.g., ID3 or C4.5. Then, the DecisionTreeClassifier is a DecisionTree with classification criteria; it returns a DecisionTree.Model (the tree) with a predictClass function (Classifier.Model?). The DecisionTreeRegresser is a DecisionTree with regression criteria and it returns a DecisionTree.Model with a predictScore function (Regressor.Model?). Formally, it looks like DecisionTree extends Estimator DecisionTreeClassifier extends DecisionTree with Classifier DecisionTreeRegressor extends DecisionTree with Regressor DecisionTree.Model extends Transformer DecisionTreeClassifier.Model extends DecisionTree.Model with Classifier.Model DecisionTreeRegressor.Model extends DecisionTree.Model with Regressor.Model Methods like LogReg, SVM, RidgeRegression, ... maintain a weight vector (one probably could summarize them to GLMs). The inner product with the example vector results naturally in a regression score for each prediction; a binary classification is then derived by thresholding that score. The underlying optimization problem for all consists of a sum over loss functions and a regularization term (regularized empirical risk minimization) that can be solved by different solvers, e.g., SGD, LBFGS... So to exploit this structure, I would expect something like this: RegularizedEmpiricalRiskMinimizer extends Estimator // LogisticRegression and SupportVectorMachine could be an automatic selection between the binomial and multinomial version BinomialLogisticRegression extends RegularizedEmpiricalRiskMinimizer MultinomialLogisticRegression extends RegularizedEmpiricalRiskMinimizer BinomialSupportVectorMachine extends RegularizedEmpiricalRiskMinimizer RidgeRegression extends RegularizedEmpiricalRiskMinimizer LinearModel extends Transformer BinomialLinearModel extends LinearModel with Classifier.Model MultinomialLinearModel extends LinearModel with Classifier.Model BinomialLogisticRegression.Model extends BinomialLinearModel with ProbabilisticClassificationModel MultinomialLogisticRegression.Model extends MultinomialLinearModel with ProbabilisticClassificationModel BinomialSupportVectorMachine.Model extends BinomialLinearModel // actually it is a binomial linear model RidgeRegression.Model extends LinearModel // actually it is a linear model So isn't the Classifier.Model more a trait than an abstract class? Perhaps, I just missed something, but I think it is helpful to consider the interfaces for specific instances. I am really interested in discussing the pros/cons.
          Hide
          josephkb Joseph K. Bradley added a comment -

          SPARK-3251 discusses a subset of the class hierarchy discussed here (for regression).

          Show
          josephkb Joseph K. Bradley added a comment - SPARK-3251 discusses a subset of the class hierarchy discussed here (for regression).
          Hide
          josephkb Joseph K. Bradley added a comment -

          Both JIRAs discuss class hierarchy. This JIRA covers more classes. SPARK-3507 covers other issues such as testing.

          Show
          josephkb Joseph K. Bradley added a comment - Both JIRAs discuss class hierarchy. This JIRA covers more classes. SPARK-3507 covers other issues such as testing.
          Hide
          josephkb Joseph K. Bradley added a comment -

          The design doc is only partly written, and I am still writing up prototypes for Xiangrui Meng's github repo with prototypes of the new API . Feedback welcome, especially since this will hopefully cover a lot of learning settings!

          Show
          josephkb Joseph K. Bradley added a comment - The design doc is only partly written, and I am still writing up prototypes for Xiangrui Meng 's github repo with prototypes of the new API . Feedback welcome, especially since this will hopefully cover a lot of learning settings!

            People

            • Assignee:
              josephkb Joseph K. Bradley
              Reporter:
              josephkb Joseph K. Bradley
            • Votes:
              4 Vote for this issue
              Watchers:
              21 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development