Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.2.0
    • Component/s: MLlib
    • Labels:
      None
    • Target Version/s:

      Description

      This part of the design doc is for pipelines and parameters. I put the design doc at

      https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing

      I will copy the proposed interfaces to this JIRA later. Some sample code can be viewed at: https://github.com/mengxr/spark-ml/

      Please help review the design and post your comments here. Thanks!

        Issue Links

          Activity

          Hide
          srowen Sean Owen added a comment -

          A few high-level questions:

          Is this a rewrite of MLlib? I see the old code will be deprecated. I assume the algorithms will come along, but in a fairly different form. I think that's actually a good thing. But is this targeted at a 2.x release, or sooner?

          How does this relate to MLI and MLbase? I had thought they would in theory handle things like grid-search, but haven't seen activity or mention of these in a while. Is this at all a merge of the two or is MLlib going to take over these concerns?

          I don't think you will need or want to use this code, but the oryx project already has an implementation of grid search on Spark. At least another take on the API for such a thing to consider. https://github.com/OryxProject/oryx/tree/master/oryx-ml/src/main/java/com/cloudera/oryx/ml/param

          Big +1 for parameter tuning. That belongs as a first-class citizen. I'm also intrigued by doing better than trying every possible combination of parameters separately, and maybe sharing partial results to speed up several models' training. Is this realistic for any parameters besides things like # iterations? which isn't really a hyperparam. I don't know, for example, ways to build N models with N different overfitting params and share some work. I would love to know that's possible. Good to design for it anyway.

          I see mention of a Dataset abstraction, which I'm assuming contains some type information, like distinguishing categorical and numeric features. I think that's very good!

          I've always found the 'pipeline' part hard to build. It's tempting to construct a framework for feature extraction. To some degree you can by providing transformations, 1-hot encoding, etc. But I think that a framework for understanding arbitrary databases and fields and so on quickly becomes too endlessly large a scope. Spark Core to me is already the right abstraction for upstream ETL of data before entering an ML framework. I mention it just because it's in the first picture, but I don't see discussion of actually doing user/product attribute selection later. So maybe it's not meant to be part of the proposal.

          I'd certainly like to keep up more with your work here. This is a big step forward in making MLlib more relevant to production deployments rather than just pure algorithms implementations.

          Show
          srowen Sean Owen added a comment - A few high-level questions: Is this a rewrite of MLlib? I see the old code will be deprecated. I assume the algorithms will come along, but in a fairly different form. I think that's actually a good thing. But is this targeted at a 2.x release, or sooner? How does this relate to MLI and MLbase? I had thought they would in theory handle things like grid-search, but haven't seen activity or mention of these in a while. Is this at all a merge of the two or is MLlib going to take over these concerns? I don't think you will need or want to use this code, but the oryx project already has an implementation of grid search on Spark. At least another take on the API for such a thing to consider. https://github.com/OryxProject/oryx/tree/master/oryx-ml/src/main/java/com/cloudera/oryx/ml/param Big +1 for parameter tuning. That belongs as a first-class citizen. I'm also intrigued by doing better than trying every possible combination of parameters separately, and maybe sharing partial results to speed up several models' training. Is this realistic for any parameters besides things like # iterations? which isn't really a hyperparam. I don't know, for example, ways to build N models with N different overfitting params and share some work. I would love to know that's possible. Good to design for it anyway. I see mention of a Dataset abstraction, which I'm assuming contains some type information, like distinguishing categorical and numeric features. I think that's very good! I've always found the 'pipeline' part hard to build. It's tempting to construct a framework for feature extraction. To some degree you can by providing transformations, 1-hot encoding, etc. But I think that a framework for understanding arbitrary databases and fields and so on quickly becomes too endlessly large a scope. Spark Core to me is already the right abstraction for upstream ETL of data before entering an ML framework. I mention it just because it's in the first picture, but I don't see discussion of actually doing user/product attribute selection later. So maybe it's not meant to be part of the proposal. I'd certainly like to keep up more with your work here. This is a big step forward in making MLlib more relevant to production deployments rather than just pure algorithms implementations.
          Hide
          mengxr Xiangrui Meng added a comment -

          Sean Owen Thanks for the comments!

          The new set of APIs is targeted at 1.2. We can keep the `spark.mllib` package in 1.2, but mark some of the APIs there deprecated, depending on the progress of the migration.

          MLI and MLbase stay on the cutting edge of research. We will gradually merge stable features there into MLlib.

          I will check Oryx's implementation and use it as a reference.

          There are multiple ways to optimize multi-model training. Besides number of iterations, we can compute multiple gradients in a single pass utilizing level-3 BLAS (Burak Yavuz is working on this), or we can compute all solutions using LARS for LASSO.

          The Dataset will be a subclass of SchemaRDD or we can generalized SchemaRDD and use it as Dataset. The design is not finished yet. But the direction is to have something like a DataFrame.

          You can specify columns in a Dataset, so for a feature selection algorithm could be applied to the user column and the product column separately. The user need to specify the input column and the output column name in this case, where the output column type is automatically determined by the estimator or transformer.

          Great to hear that you are interested in the development! I will submit a PR to show a prototype later this week.

          Show
          mengxr Xiangrui Meng added a comment - Sean Owen Thanks for the comments! The new set of APIs is targeted at 1.2. We can keep the `spark.mllib` package in 1.2, but mark some of the APIs there deprecated, depending on the progress of the migration. MLI and MLbase stay on the cutting edge of research. We will gradually merge stable features there into MLlib. I will check Oryx's implementation and use it as a reference. There are multiple ways to optimize multi-model training. Besides number of iterations, we can compute multiple gradients in a single pass utilizing level-3 BLAS ( Burak Yavuz is working on this), or we can compute all solutions using LARS for LASSO. The Dataset will be a subclass of SchemaRDD or we can generalized SchemaRDD and use it as Dataset. The design is not finished yet. But the direction is to have something like a DataFrame. You can specify columns in a Dataset, so for a feature selection algorithm could be applied to the user column and the product column separately. The user need to specify the input column and the output column name in this case, where the output column type is automatically determined by the estimator or transformer. Great to hear that you are interested in the development! I will submit a PR to show a prototype later this week.
          Hide
          matei Matei Zaharia added a comment -

          To comment on the versioning stuff here, "deprecated" doesn't mean unsupported, it just means we encourage using something else. So the old MLlib API will remain in 1.x, and will continue getting tested and bug-fixed, but it will not get new features.

          Show
          matei Matei Zaharia added a comment - To comment on the versioning stuff here, "deprecated" doesn't mean unsupported, it just means we encourage using something else. So the old MLlib API will remain in 1.x, and will continue getting tested and bug-fixed, but it will not get new features.
          Hide
          eustache Eustache added a comment -

          Great to see the design docs !

          A few questions/remarks:

          • Big +1 for Pipeline and Dataset as first-class abstractions - being a long time sklearn user Pipelines are a very convenient way to think for many problems, e.g. implementing Cascades of models integrate unsupervised steps for feature transformation in a supervised task etc
          • Isn't the "fit multiple models at once" part a bit of an early optimization ? How many users would benefit from it ? IMHO it complicates the API for most users.
          • I'm also wondering if a meta class wouldn't be capable of doing multiple models. AFAICT fitting multiple models at once resembles a parameter grid search isn't it? I assume the later would return evaluation metrics for each parameter set as well as the model itself, right ?
          • It seems to me that multi-task learning would be a good example for the "multiple models at once" but is maybe not a typical example of what most users would want. Also I'm not 100% sure the implementation should necessarily profit from such an API
          Show
          eustache Eustache added a comment - Great to see the design docs ! A few questions/remarks: Big +1 for Pipeline and Dataset as first-class abstractions - being a long time sklearn user Pipelines are a very convenient way to think for many problems, e.g. implementing Cascades of models integrate unsupervised steps for feature transformation in a supervised task etc Isn't the "fit multiple models at once" part a bit of an early optimization ? How many users would benefit from it ? IMHO it complicates the API for most users. I'm also wondering if a meta class wouldn't be capable of doing multiple models. AFAICT fitting multiple models at once resembles a parameter grid search isn't it? I assume the later would return evaluation metrics for each parameter set as well as the model itself, right ? It seems to me that multi-task learning would be a good example for the "multiple models at once" but is maybe not a typical example of what most users would want. Also I'm not 100% sure the implementation should necessarily profit from such an API
          Hide
          sandyr Sandy Ryza added a comment -

          Isn't the "fit multiple models at once" part a bit of an early optimization ?

          I personally think this is a useful feature in nearly all situations and that parameter search is one of the most important problems for a machine learning framework to address. Avoiding premature optimization usually refers to getting bang for buck in terms of time spent. However, if this something we think might even be eventually useful, it's worth making API decisions that will accommodate it.

          Show
          sandyr Sandy Ryza added a comment - Isn't the "fit multiple models at once" part a bit of an early optimization ? I personally think this is a useful feature in nearly all situations and that parameter search is one of the most important problems for a machine learning framework to address. Avoiding premature optimization usually refers to getting bang for buck in terms of time spent. However, if this something we think might even be eventually useful, it's worth making API decisions that will accommodate it.
          Hide
          vrilleup Li Pu added a comment -

          Nice design doc! I had some experiences on the parameter part. It would be great to have Constraints on the individual parameters and on the Params level. For example, learning_rate must be greater than 0, and regularization can be one of "l1", "l2". Parameter check is something that every learning algorithm does, so some support at parameter definition time would make code more concise.

          abstract class ParamConstraint[T] extends Serializable {
          def isValid(value: T): Boolean
          def invalidMessage(value: T): String
          }

          class IntRangeConstraint(min: Int, max: Int) extends ParamConstraint[Int] {
          def isValid(value: Int) = value >= min && value <= max
          def invalidMessage(value: T) = "..."
          }

          class Param[T] (..., constraints: List[ParamConstraint[T]] = List())

          {...}

          // constraints is a list because there might be more than one type of constraints applied to this Param

          at definition time, we can write:
          val maxIter: Param[Int] = new Param(id, “maxIter”, “max number of iterations”, 100, List(new IntRangeConstraint(1, 500)))

          There shouldn't be too many types of constraints, so ml could provide a list of commonly used constraint classes. Keeping parameter definition and constraint in the same line also improves readability. Params trait could use similar structure to check constraints on multiple parameters, but this is less likely to happen in real use cases. In the end, validateParams of Params just call isValid of all member Param as default implementation.

          Show
          vrilleup Li Pu added a comment - Nice design doc! I had some experiences on the parameter part. It would be great to have Constraints on the individual parameters and on the Params level. For example, learning_rate must be greater than 0, and regularization can be one of "l1", "l2". Parameter check is something that every learning algorithm does, so some support at parameter definition time would make code more concise. abstract class ParamConstraint [T] extends Serializable { def isValid(value: T): Boolean def invalidMessage(value: T): String } class IntRangeConstraint(min: Int, max: Int) extends ParamConstraint [Int] { def isValid(value: Int) = value >= min && value <= max def invalidMessage(value: T) = "..." } class Param [T] (..., constraints: List[ParamConstraint [T] ] = List()) {...} // constraints is a list because there might be more than one type of constraints applied to this Param at definition time, we can write: val maxIter: Param [Int] = new Param(id, “maxIter”, “max number of iterations”, 100, List(new IntRangeConstraint(1, 500))) There shouldn't be too many types of constraints, so ml could provide a list of commonly used constraint classes. Keeping parameter definition and constraint in the same line also improves readability. Params trait could use similar structure to check constraints on multiple parameters, but this is less likely to happen in real use cases. In the end, validateParams of Params just call isValid of all member Param as default implementation.
          Hide
          mengxr Xiangrui Meng added a comment - - edited

          Eustache The default implementation of multi-model training will be a for loop. But the API leaves space for future optimizations, like grouping weight vectors and using level-3 BLAS for better performance. It shouldn't be a meta class, because many optimizations are specific. For example, LASSO can be solved via LARS, which computes a full solution path for all regularization parameters. The level-3 BLAS optimization is another example, which can give 8x speedup (SPARK-1486).

          Li Pu We can have a set of built-in preconditions, like positivity. Or we could accept lambda function for assertions (T) => Unit, which may be hard for Java users but they should be familiar of creating those in Spark.

          Show
          mengxr Xiangrui Meng added a comment - - edited Eustache The default implementation of multi-model training will be a for loop. But the API leaves space for future optimizations, like grouping weight vectors and using level-3 BLAS for better performance. It shouldn't be a meta class, because many optimizations are specific. For example, LASSO can be solved via LARS, which computes a full solution path for all regularization parameters. The level-3 BLAS optimization is another example, which can give 8x speedup ( SPARK-1486 ). Li Pu We can have a set of built-in preconditions, like positivity. Or we could accept lambda function for assertions (T) => Unit, which may be hard for Java users but they should be familiar of creating those in Spark.
          Hide
          epakhomov Egor Pakhomov added a comment -

          Nice doc.
          Parameters passing as part of grid search and pipeline creation great and important feature, but it's only one of the features. For me it's more important to see Estimator abstraction in spark code base early, may be not earlier than introducing dataset abstraction, but definitely earlier than any work on grid search.

          When we where thinking on creating such pipeline framework we came to conclusion that transformations in this pipeline is like steps in oozie workflow - they should be easy retrieble, be persisted, and have some queue. It's because transformation can take hours and rerun the whole pipeline in case of step failure is expensive. Pipeline can consist of gridsearch with parameters search, which means, that there are a lot of parallel executions, which need wise scheduling. So I think pipeline should be executed on some cluster wise scheduler with some persistence. I'm not saying, that it's absolutly necessary now, but it would be great to have architecture open to such possibility.

          Show
          epakhomov Egor Pakhomov added a comment - Nice doc. Parameters passing as part of grid search and pipeline creation great and important feature, but it's only one of the features. For me it's more important to see Estimator abstraction in spark code base early, may be not earlier than introducing dataset abstraction, but definitely earlier than any work on grid search. When we where thinking on creating such pipeline framework we came to conclusion that transformations in this pipeline is like steps in oozie workflow - they should be easy retrieble, be persisted, and have some queue. It's because transformation can take hours and rerun the whole pipeline in case of step failure is expensive. Pipeline can consist of gridsearch with parameters search, which means, that there are a lot of parallel executions, which need wise scheduling. So I think pipeline should be executed on some cluster wise scheduler with some persistence. I'm not saying, that it's absolutly necessary now, but it would be great to have architecture open to such possibility.
          Hide
          apachespark Apache Spark added a comment -

          User 'mengxr' has created a pull request for this issue:
          https://github.com/apache/spark/pull/3099

          Show
          apachespark Apache Spark added a comment - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/3099
          Hide
          mengxr Xiangrui Meng added a comment -

          Issue resolved by pull request 3099
          https://github.com/apache/spark/pull/3099

          Show
          mengxr Xiangrui Meng added a comment - Issue resolved by pull request 3099 https://github.com/apache/spark/pull/3099
          Hide
          rajao Jao Rabary added a comment - - edited

          Some questions after playing a little with the new ml.Pipeline.

          We mainly do large scale computer vision task (image classification, retrieval, ...). The pipeline is really great stuff for that. We're trying to reproduce the tutorial given on that topic during the latest spark summit ( http://ampcamp.berkeley.edu/5/exercises/image-classification-with-pipelines.html ) using the master version of spark pipeline and dataframe. The tutorial shows different examples of feature extraction stages before running machine learning algorithms. Even the tutorial is straightforward to reproduce with this new API, we still have some questions :

          • Can one use external tools (e.g via pipe) as a pipeline stage ? An example of use case is to extract feature learned with convolutional neural network. In our case, this corresponds to a pre-trained neural network with Caffe library (http://caffe.berkeleyvision.org/) .
          • The second question is about the performance of the pipeline. Library such as Caffe processes the data in batch and instancing one Caffe network can be time consuming when this network is very deep. So, we can gain performance if we minimize the number of Caffe network creation and give data in batch to the network. In the pipeline, this corresponds to run transformers that work on a partition basis and give the whole partition to a single caffe network. How can we create such a transformer ?
          Show
          rajao Jao Rabary added a comment - - edited Some questions after playing a little with the new ml.Pipeline. We mainly do large scale computer vision task (image classification, retrieval, ...). The pipeline is really great stuff for that. We're trying to reproduce the tutorial given on that topic during the latest spark summit ( http://ampcamp.berkeley.edu/5/exercises/image-classification-with-pipelines.html ) using the master version of spark pipeline and dataframe. The tutorial shows different examples of feature extraction stages before running machine learning algorithms. Even the tutorial is straightforward to reproduce with this new API, we still have some questions : Can one use external tools (e.g via pipe) as a pipeline stage ? An example of use case is to extract feature learned with convolutional neural network. In our case, this corresponds to a pre-trained neural network with Caffe library ( http://caffe.berkeleyvision.org/ ) . The second question is about the performance of the pipeline. Library such as Caffe processes the data in batch and instancing one Caffe network can be time consuming when this network is very deep. So, we can gain performance if we minimize the number of Caffe network creation and give data in batch to the network. In the pipeline, this corresponds to run transformers that work on a partition basis and give the whole partition to a single caffe network. How can we create such a transformer ?
          Hide
          sparks Evan Sparks added a comment -

          We have looked at integrating Caffe with spark - probably the most logical place to do this (as a proof of concept) is via pyspark, since Caffe has python bindings. If you're going to use a pre-trained model in Caffe, this should parallelize well and might well suit your needs. Training the pipeline in parallel is trickier because the communication requirements for training these types of networks via mini-batch SGD are very high.

          If you're in the scenario where you want to instantiate a pre-trained Caffe network on all the workers (and only do it once), then a broadcast variable is probably a good way to go, and I'd expect that it fits nicely into the pipelines framework.

          Show
          sparks Evan Sparks added a comment - We have looked at integrating Caffe with spark - probably the most logical place to do this (as a proof of concept) is via pyspark, since Caffe has python bindings. If you're going to use a pre-trained model in Caffe, this should parallelize well and might well suit your needs. Training the pipeline in parallel is trickier because the communication requirements for training these types of networks via mini-batch SGD are very high. If you're in the scenario where you want to instantiate a pre-trained Caffe network on all the workers (and only do it once), then a broadcast variable is probably a good way to go, and I'd expect that it fits nicely into the pipelines framework.
          Hide
          rajao Jao Rabary added a comment -

          Yes, the scenario is to instantiate a pre-trained caffe network. The problem with the broadcast is that I use a JNI binding of caffe and spark isn't able to serialize the object.

          Show
          rajao Jao Rabary added a comment - Yes, the scenario is to instantiate a pre-trained caffe network. The problem with the broadcast is that I use a JNI binding of caffe and spark isn't able to serialize the object.

            People

            • Assignee:
              mengxr Xiangrui Meng
              Reporter:
              mengxr Xiangrui Meng
            • Votes:
              1 Vote for this issue
              Watchers:
              23 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development