Mahout
  1. Mahout
  2. MAHOUT-612

Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

    Details

      Description

      Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
      Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.

      I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)

      • The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
      • The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.

      I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this.

      One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.

      This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

      1. MAHOUT-612.patch
        22 kB
        Frank Scholten
      2. MAHOUT-612-canopy.patch
        50 kB
        Frank Scholten
      3. MAHOUT-612-kmeans.patch
        73 kB
        Frank Scholten
      4. MAHOUT-612-v2.patch
        46 kB
        Frank Scholten

        Activity

        Hide
        Grant Ingersoll added a comment -

        It seems like we shouldn't have to wait for the whole thing to be done on this. Forward progress towards where we want to go is better than no progress.

        Show
        Grant Ingersoll added a comment - It seems like we shouldn't have to wait for the whole thing to be done on this. Forward progress towards where we want to go is better than no progress.
        Hide
        Frank Scholten added a comment -

        Putting this in backlog for now. As much as I like the idea of this patch, improving parts of the clustering code I work with regularly has a higher priority for me. So far the kmeans, canopy and seq2sparse jobs have been refactored to have a bean configuration. If you want to help with this, check out the github repo at https://github.com/frankscholten/mahout/tree/MAHOUT-612-0.5

        Show
        Frank Scholten added a comment - Putting this in backlog for now. As much as I like the idea of this patch, improving parts of the clustering code I work with regularly has a higher priority for me. So far the kmeans, canopy and seq2sparse jobs have been refactored to have a bean configuration. If you want to help with this, check out the github repo at https://github.com/frankscholten/mahout/tree/MAHOUT-612-0.5
        Hide
        Sean Owen added a comment -

        Looking good, marking for 0.6

        Show
        Sean Owen added a comment - Looking good, marking for 0.6
        Hide
        Frank Scholten added a comment -

        Pushed seq2sparse configuration to Github http://bit.ly/rmWAf4

        Show
        Frank Scholten added a comment - Pushed seq2sparse configuration to Github http://bit.ly/rmWAf4
        Hide
        Frank Scholten added a comment -

        Just pushed a new branch to Github, https://github.com/frankscholten/mahout/tree/MAHOUT-612-0.5, rebased at 0.5 with one commit of both KMeans and Canopy config. Next up is SparseVectorsFromSequenceFiles.

        Show
        Frank Scholten added a comment - Just pushed a new branch to Github, https://github.com/frankscholten/mahout/tree/MAHOUT-612-0.5 , rebased at 0.5 with one commit of both KMeans and Canopy config. Next up is SparseVectorsFromSequenceFiles.
        Hide
        Frank Scholten added a comment -

        Cool that you're interested! I recently rebased my changes locally on a Mahout 0.5 branch and I'm making the Canopy configuration consistent with the KMeans configuration, wrt serialization and coding style. This is taking some time as I'm fixing a bunch of Canopy unit tests. I will push this to Github soon.

        After this I think it's important that SparseVectorsToSequenceFile is refactored since it's almost always needed for clustering jobs.

        Show
        Frank Scholten added a comment - Cool that you're interested! I recently rebased my changes locally on a Mahout 0.5 branch and I'm making the Canopy configuration consistent with the KMeans configuration, wrt serialization and coding style. This is taking some time as I'm fixing a bunch of Canopy unit tests. I will push this to Github soon. After this I think it's important that SparseVectorsToSequenceFile is refactored since it's almost always needed for clustering jobs.
        Hide
        Ian Helmke added a comment -

        Frank, are you still making changes here? Benson and I are looking to continue/complete the beanification of these jobs. Just wondering if you'd made any progress on it.

        Show
        Ian Helmke added a comment - Frank, are you still making changes here? Benson and I are looking to continue/complete the beanification of these jobs. Just wondering if you'd made any progress on it.
        Hide
        Frank Scholten added a comment -

        Still at KMeans and Canopy. After Berlin Buzzwords I'll have time to continue with this issue.

        Show
        Frank Scholten added a comment - Still at KMeans and Canopy. After Berlin Buzzwords I'll have time to continue with this issue.
        Hide
        Sean Owen added a comment -

        Frank, how far along are you here? It would be great to commit this once you've hit all the jobs you intend to.

        Show
        Sean Owen added a comment - Frank, how far along are you here? It would be great to commit this once you've hit all the jobs you intend to.
        Hide
        Frank Scholten added a comment -

        Latest version of K-Means driver refactoring in sync with trunk

        Show
        Frank Scholten added a comment - Latest version of K-Means driver refactoring in sync with trunk
        Hide
        Sean Owen added a comment -

        I think one big patch is preferable. It avoids risk that the patch can't be completed for some reason. It should be about as much work, including merge conflicts. But I don't think it is a big deal if you'd like to do it piece by piece too.

        Show
        Sean Owen added a comment - I think one big patch is preferable. It avoids risk that the patch can't be completed for some reason. It should be about as much work, including merge conflicts. But I don't think it is a big deal if you'd like to do it piece by piece too.
        Hide
        Frank Scholten added a comment -

        These patches are outdated. This will indeed be ongoing piece of work
        and won't be done by 0.5

        What's the thinking on how to include the work? Robin said: "But,
        before committing I will wait for the full change to all Jobs so that
        code is not in un-even state."

        I would prefer to be able to submit a patch per job configuration, one
        for K-means, one for Canopy and so on. The code will be in an un-even
        state, true, however this will prevent a lot of effort to merge
        changes in trunk later on, considering how active Mahout is being developed.

        Show
        Frank Scholten added a comment - These patches are outdated. This will indeed be ongoing piece of work and won't be done by 0.5 What's the thinking on how to include the work? Robin said: "But, before committing I will wait for the full change to all Jobs so that code is not in un-even state." I would prefer to be able to submit a patch per job configuration, one for K-means, one for Canopy and so on. The code will be in an un-even state, true, however this will prevent a lot of effort to merge changes in trunk later on, considering how active Mahout is being developed.
        Hide
        Sean Owen added a comment -

        May I start submitting the patches? the "v2" and "canopy" patches are ready to go?
        What's the thinking about whether this will be considered done by 0.5 in a few weeks or should be an ongoing piece of work for the next release?

        Show
        Sean Owen added a comment - May I start submitting the patches? the "v2" and "canopy" patches are ready to go? What's the thinking about whether this will be considered done by 0.5 in a few weeks or should be an ongoing piece of work for the next release?
        Hide
        Frank Scholten added a comment - - edited

        Yes, good idea. How about

        interface SerializableConfiguration<T> {
         
          T getFromConfiguration(Configuration configuration);
        
          Configuration serializeInConfiguration(T t);
        
        }
        

        that will be implemented by KMeansConfiguration, CanopyConfiguration and configuration classes yet to be created.

        The KMeansConfiguration equals is used in KMeansConfigurationTest via assertEquals.

        Show
        Frank Scholten added a comment - - edited Yes, good idea. How about interface SerializableConfiguration<T> { T getFromConfiguration(Configuration configuration); Configuration serializeInConfiguration(T t); } that will be implemented by KMeansConfiguration, CanopyConfiguration and configuration classes yet to be created. The KMeansConfiguration equals is used in KMeansConfigurationTest via assertEquals.
        Hide
        Robin Anil added a comment -

        Sorry about the late reply

        The methods serialized and deserialized can be made stricter using an interface and maybe renamed to make it explicit.

        Maybe an interface which has the "SerializeInConfiguration() GetFromConfiguration()" method, to make things strictly uniform.

        It looks good otherwise.

        A Small Nit: Autogenerated Equals() and Hashcode() is ok, but do you see it being used? Made as keys in hashMaps? You can choose to ignore them if you wish (or throw an exception). IMO Config Object's primary purpose is to Deserialize or Serialize its members.

        Robin

        Show
        Robin Anil added a comment - Sorry about the late reply The methods serialized and deserialized can be made stricter using an interface and maybe renamed to make it explicit. Maybe an interface which has the "SerializeInConfiguration() GetFromConfiguration()" method, to make things strictly uniform. It looks good otherwise. A Small Nit: Autogenerated Equals() and Hashcode() is ok, but do you see it being used? Made as keys in hashMaps? You can choose to ignore them if you wish (or throw an exception). IMO Config Object's primary purpose is to Deserialize or Serialize its members. Robin
        Hide
        Frank Scholten added a comment -

        I added K-Means config serialization code at https://github.com/frankscholten/mahout/tree/MAHOUT-612 See 'kmeans-serialization' tag

        Robin: Is this close to what you had in mind?

        Show
        Frank Scholten added a comment - I added K-Means config serialization code at https://github.com/frankscholten/mahout/tree/MAHOUT-612 See 'kmeans-serialization' tag Robin: Is this close to what you had in mind?
        Hide
        Frank Scholten added a comment -

        I started a MAHOUT-612 branch at https://github.com/frankscholten/mahout/tree/MAHOUT-612 and added the K-Means v2 and Canopy patches.

        Robin: Ok, I'll look into the serialization issue for K-Means and Canopy next.

        Show
        Frank Scholten added a comment - I started a MAHOUT-612 branch at https://github.com/frankscholten/mahout/tree/MAHOUT-612 and added the K-Means v2 and Canopy patches. Robin: Ok, I'll look into the serialization issue for K-Means and Canopy next.
        Hide
        Ted Dunning added a comment -

        Isabel,

        I find that keeping large patches up to date with only an SVN branch is infeasible. Thus, I opt to use git privately and use the SVN interface to push changes back to SVN when committing.

        Once I am doing that, why not share my git repository so that others can comment on the work in progress? I still will have to be careful about what code I incorporate, but that is the responsibility of a committer in any case.

        Hopefully, this should become irrelevant soon since Apache is making rapid progress on supporting git.

        Show
        Ted Dunning added a comment - Isabel, I find that keeping large patches up to date with only an SVN branch is infeasible. Thus, I opt to use git privately and use the SVN interface to push changes back to SVN when committing. Once I am doing that, why not share my git repository so that others can comment on the work in progress? I still will have to be careful about what code I incorporate, but that is the responsibility of a committer in any case. Hopefully, this should become irrelevant soon since Apache is making rapid progress on supporting git.
        Hide
        Isabel Drost-Fromm added a comment -

        Robin: Putting my Apache Hat on - I know how easy github makes collaboration, however it would be nice to keep development inside of our project, so until Apache supports for git r/w access, I was only wondering whether an svn branch would provide any benefit ...

        Show
        Isabel Drost-Fromm added a comment - Robin: Putting my Apache Hat on - I know how easy github makes collaboration, however it would be nice to keep development inside of our project, so until Apache supports for git r/w access, I was only wondering whether an svn branch would provide any benefit ...
        Hide
        Robin Anil added a comment -

        Isabel: Yeah github is the easiest way to go.

        Show
        Robin Anil added a comment - Isabel: Yeah github is the easiest way to go.
        Hide
        Robin Anil added a comment -

        See FPGrowthParameters. It does something similar. I do not think that will have any performance effect, we are talking about < 10KB data here.

        Show
        Robin Anil added a comment - See FPGrowthParameters. It does something similar. I do not think that will have any performance effect, we are talking about < 10KB data here.
        Hide
        Frank Scholten added a comment -

        Robin: Maybe I understand what you mean about serializing the config. At the moment the mappers and reducers still need to access values in the Configuration object via the config keys. Is it possible turn the (KMeans|Canopy)Configuration into a simple pojo, have it implement Writable and serialize it inside the Configuration and deserialize it at the mapper and reducer? Or does this have performance implications or other consequences?

        We could maybe make a method in (KMeans|Canopy)Configuration

        public Configuration asConfiguration()

        { ... }

        where it serializes itself inside a Configuration and then returns it.

        Show
        Frank Scholten added a comment - Robin: Maybe I understand what you mean about serializing the config. At the moment the mappers and reducers still need to access values in the Configuration object via the config keys. Is it possible turn the (KMeans|Canopy)Configuration into a simple pojo, have it implement Writable and serialize it inside the Configuration and deserialize it at the mapper and reducer? Or does this have performance implications or other consequences? We could maybe make a method in (KMeans|Canopy)Configuration public Configuration asConfiguration() { ... } where it serializes itself inside a Configuration and then returns it.
        Hide
        Frank Scholten added a comment -

        Added patch for the canopy jobs.

        This time I also moved the logic of composing output paths (canopies and points) from the CanopyDriver into the configuration object.

        The config keys are now moved to CanopyConfiguration and CanopyConfigKeys is removed. Some keys are different because I renamed output to outputBasePath to make it clear the canopy and points outputs are relative paths under this outputBasePath.

        Show
        Frank Scholten added a comment - Added patch for the canopy jobs. This time I also moved the logic of composing output paths (canopies and points) from the CanopyDriver into the configuration object. The config keys are now moved to CanopyConfiguration and CanopyConfigKeys is removed. Some keys are different because I renamed output to outputBasePath to make it clear the canopy and points outputs are relative paths under this outputBasePath.
        Hide
        Isabel Drost-Fromm added a comment -

        Robin I agree with your concern of avoiding an un-even state in the code-base. Given the anticipated amount of work that has to go into this, would make sense to track these changes in a separate branch to avoid the "one huge patch that touches everything at once" problem?

        Show
        Isabel Drost-Fromm added a comment - Robin I agree with your concern of avoiding an un-even state in the code-base. Given the anticipated amount of work that has to go into this, would make sense to track these changes in a separate branch to avoid the "one huge patch that touches everything at once" problem?
        Hide
        Frank Scholten added a comment -

        Ok, I'll tackle the canopy jobs next.

        What exactly do you mean by serializing and deserializing the configuration object at once in the job/mapper?

        Show
        Frank Scholten added a comment - Ok, I'll tackle the canopy jobs next. What exactly do you mean by serializing and deserializing the configuration object at once in the job/mapper?
        Hide
        Robin Anil added a comment -

        Indeed! Removes a whole lot of paramaters from the function. Code looks a lot nicer now. I would like to have the configuration object serialized and deserialized at once in job/mapper. or merged with the configuration object in some generic way maybe a base MahoutConfigBase class. All these are nice to haves. If you can, I would really appreciate such a change.

        But, before committing I will wait for the full change to all Jobs so that code is not in un-even state.

        Show
        Robin Anil added a comment - Indeed! Removes a whole lot of paramaters from the function. Code looks a lot nicer now. I would like to have the configuration object serialized and deserialized at once in job/mapper. or merged with the configuration object in some generic way maybe a base MahoutConfigBase class. All these are nice to haves. If you can, I would really appreciate such a change. But, before committing I will wait for the full change to all Jobs so that code is not in un-even state.
        Hide
        Sean Owen added a comment -

        This looks like quite a positive change, at the macro and micro level. Robin any thoughts? I can commit in a short while otherwise.

        Show
        Sean Owen added a comment - This looks like quite a positive change, at the macro and micro level. Robin any thoughts? I can commit in a short while otherwise.
        Hide
        Frank Scholten added a comment - - edited

        Updated and expanded the patch. Renamed KMeansMapReduceJob to KMeansMapReduceAlgorithm and added KMeansSequentialAlgorithm.

        These implementations also create the points mapping by default, based on the runClustering flag.

        The KMeansConfiguration can be used for both of these implementations.

        Show
        Frank Scholten added a comment - - edited Updated and expanded the patch. Renamed KMeansMapReduceJob to KMeansMapReduceAlgorithm and added KMeansSequentialAlgorithm. These implementations also create the points mapping by default, based on the runClustering flag. The KMeansConfiguration can be used for both of these implementations.
        Hide
        Sean Owen added a comment -

        I think I understand your patch. You're leaving KMeansDriver as the shell with which to run it from the command line, but introducing one more layer of abstraction between it and running Hadoop MapReduces so that it can be invoked programmatically. Sounds fine to me.

        My only bit of feedback then is about naming. We unfortunately have some conflicting naming here for the command-line class that runs MapReduces and implements Tool. It's a "*Job" in some places and "*Driver" in other places. (Anyone prefer one convention? I could JIRA that too.)

        To avoid deepening the confusion, consider renaming KMeansMapReduceJob to something that doesn't end in either of those.

        Show
        Sean Owen added a comment - I think I understand your patch. You're leaving KMeansDriver as the shell with which to run it from the command line, but introducing one more layer of abstraction between it and running Hadoop MapReduces so that it can be invoked programmatically. Sounds fine to me. My only bit of feedback then is about naming. We unfortunately have some conflicting naming here for the command-line class that runs MapReduces and implements Tool. It's a "*Job" in some places and "*Driver" in other places. (Anyone prefer one convention? I could JIRA that too.) To avoid deepening the confusion, consider renaming KMeansMapReduceJob to something that doesn't end in either of those.

          People

          • Assignee:
            Unassigned
            Reporter:
            Frank Scholten
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development