Details

      Description

      Update – This issue is at this point open to track the mission to standardize, at least, on using one version of Hadoop's API. At the moment, it means using the .mapreduce.* classes instead of .mapred.* classes.

      1. MAHOUT-294a.patch
        31 kB
        Jeff Eastman
      2. MAHOUT-294.patch
        118 kB
        Jeff Eastman
      3. MAHOUT-294.patch
        106 kB
        Jeff Eastman

        Issue Links

          Activity

          Hide
          Jake Mannix added a comment -

          Have you checked out my patch on MAHOUT-301 - it's related, but not exactly the same - it's the external external api, which will be even cleaner once this issue gets some progress.

          Show
          Jake Mannix added a comment - Have you checked out my patch on MAHOUT-301 - it's related, but not exactly the same - it's the external external api, which will be even cleaner once this issue gets some progress.
          Hide
          Jeff Eastman added a comment - - edited

          I've cleaned up the clustering driver command processing by refactoring all of the options into the DefaultOptionCreator and standardizing option names, descriptions, required/optional and default values. I also updated all of the clustering .props files for MAHOUT-301 to include sections on required and optional parameters and default values. All .props files contain comments that can be activated by users editing the files.

          Show
          Jeff Eastman added a comment - - edited I've cleaned up the clustering driver command processing by refactoring all of the options into the DefaultOptionCreator and standardizing option names, descriptions, required/optional and default values. I also updated all of the clustering .props files for MAHOUT-301 to include sections on required and optional parameters and default values. All .props files contain comments that can be activated by users editing the files.
          Hide
          Jeff Eastman added a comment -

          Removing this from clustering project as all drivers now have consistent options

          Show
          Jeff Eastman added a comment - Removing this from clustering project as all drivers now have consistent options
          Hide
          Robin Anil added a comment -

          Adding clustering back. Saw some bugs

          KMeans put the -k parameter as required=true. So It was overwriting centroids even when not specified, instead of reading it
          LDA: Topic smoothing was changed to default of -1 (it should be 50/numTopics)

          Show
          Robin Anil added a comment - Adding clustering back. Saw some bugs KMeans put the -k parameter as required=true. So It was overwriting centroids even when not specified, instead of reading it LDA: Topic smoothing was changed to default of -1 (it should be 50/numTopics)
          Hide
          Jeff Eastman added a comment -

          The bug reported above has been fixed (topicSmoothing was doing the right thing in the driver) and I've been looking at reparenting the clustering drivers to AbstractJob. The primary benefits I can see are the consolidation of input and output command options and the parseArguments() method which is useful for most - but not all - map/reduce job configuration steps. The class requires an instance run(String[]) method (Tool requires it actually) and this will require some refactoring of the clustering drivers since they all have a static runJob() method that accepts Java arguments and is called by main(). Also, using addOption() will require backing out recent changes in DefaultOptionsCreator or at least refactoring that too.

          It is interesting that the Tool interface comment uses both Configuration (hadoop 0.20.2) and JobConf (deprecated).

          Anyway, I'm not sure this warrants all the effort. I'm game but need a little more encouragement/benefits.

          Show
          Jeff Eastman added a comment - The bug reported above has been fixed (topicSmoothing was doing the right thing in the driver) and I've been looking at reparenting the clustering drivers to AbstractJob. The primary benefits I can see are the consolidation of input and output command options and the parseArguments() method which is useful for most - but not all - map/reduce job configuration steps. The class requires an instance run(String[]) method (Tool requires it actually) and this will require some refactoring of the clustering drivers since they all have a static runJob() method that accepts Java arguments and is called by main(). Also, using addOption() will require backing out recent changes in DefaultOptionsCreator or at least refactoring that too. It is interesting that the Tool interface comment uses both Configuration (hadoop 0.20.2) and JobConf (deprecated). Anyway, I'm not sure this warrants all the effort. I'm game but need a little more encouragement/benefits.
          Hide
          Ted Dunning added a comment -

          Anyway, I'm not sure this warrants all the effort. I'm game but need a little more encouragement/benefits.

          Here is some generic encouragement.

          Right now, the only integration strategy we have for the machine learning side of the house is command line based. Getting that to be consistent is really important for user understanding. As such, what you are doing is almost more helpful than any single new algorithm.

          (does that help?)

          Show
          Ted Dunning added a comment - Anyway, I'm not sure this warrants all the effort. I'm game but need a little more encouragement/benefits. Here is some generic encouragement. Right now, the only integration strategy we have for the machine learning side of the house is command line based. Getting that to be consistent is really important for user understanding. As such, what you are doing is almost more helpful than any single new algorithm. (does that help?)
          Hide
          Drew Farris added a comment -

          Anyway, I'm not sure this warrants all the effort. I'm game but need a little more encouragement/benefits

          Here's more. Also, more use cases for AbstractJob will help us work out the nits. When I was last working on it, I noticed that there was some frission between DefaultOptionsCreator and addOption(), I'm very much interested to hear your thoughts on the subject Jeff.

          Show
          Drew Farris added a comment - Anyway, I'm not sure this warrants all the effort. I'm game but need a little more encouragement/benefits Here's more. Also, more use cases for AbstractJob will help us work out the nits. When I was last working on it, I noticed that there was some frission between DefaultOptionsCreator and addOption(), I'm very much interested to hear your thoughts on the subject Jeff.
          Hide
          Jeff Eastman added a comment - - edited

          I've seen the light and attached is a patch that goes most of the way towards being a good child of AbstractJob. It refactors the command parsing out of main() into run() and also refactors the guts out of static runJob() into non-static job() which does all the heavy lifting. Job() is also called from run() so the whole thing hangs together pretty well and none of the runJob clients were impacted.

          On the DefaultOptionsCreator, I did some constant extraction and inlined any methods that were called only by one other driver. The remaining options are shared and i used addOption(DefaultOptionsCreator.blahOption().create()) rather than exploding them to use the expanded addOption(...). I found a precedent for this in CollocDriver and it seemed like a lot of busy work to refactor all the usages of DefaultOptionsCreator so I did not do that.

          I also did not use AbstractJob.prepareJob() but left the individual conf and job initializations in place since they are working. There was also precedent for this. Clustering has several job steps (e.g. runClustering) which do not use reducers and prepareJob doesn't work for them.

          In terms of feedback on the AbstractJob design, I find the need to create two constants unwieldy as in:

            private static final String NUM_CLUSTERS_OPTION = "numClusters";
            public static final String NUM_CLUSTERS_OPTION_KEY = "--" + NUM_CLUSTERS_OPTION;
          

          since the options argument map is keyed by "--" prepended to the option's long name. My preference would be to omit that but I see the code is using Option.preferredName() which is not subject to modification.

          All the unit tests run but they don't really test the command line processing. The clustering tests use runJob() with java arguments instead. I admit to being a bit on the lazy side in terms of some possible refactoring but think this is pretty close to the target.

          Show
          Jeff Eastman added a comment - - edited I've seen the light and attached is a patch that goes most of the way towards being a good child of AbstractJob. It refactors the command parsing out of main() into run() and also refactors the guts out of static runJob() into non-static job() which does all the heavy lifting. Job() is also called from run() so the whole thing hangs together pretty well and none of the runJob clients were impacted. On the DefaultOptionsCreator, I did some constant extraction and inlined any methods that were called only by one other driver. The remaining options are shared and i used addOption(DefaultOptionsCreator.blahOption().create()) rather than exploding them to use the expanded addOption(...). I found a precedent for this in CollocDriver and it seemed like a lot of busy work to refactor all the usages of DefaultOptionsCreator so I did not do that. I also did not use AbstractJob.prepareJob() but left the individual conf and job initializations in place since they are working. There was also precedent for this. Clustering has several job steps (e.g. runClustering) which do not use reducers and prepareJob doesn't work for them. In terms of feedback on the AbstractJob design, I find the need to create two constants unwieldy as in: private static final String NUM_CLUSTERS_OPTION = "numClusters" ; public static final String NUM_CLUSTERS_OPTION_KEY = "--" + NUM_CLUSTERS_OPTION; since the options argument map is keyed by "--" prepended to the option's long name. My preference would be to omit that but I see the code is using Option.preferredName() which is not subject to modification. All the unit tests run but they don't really test the command line processing. The clustering tests use runJob() with java arguments instead. I admit to being a bit on the lazy side in terms of some possible refactoring but think this is pretty close to the target.
          Hide
          Jeff Eastman added a comment -

          Revised patch that fixes some problems uncovered by included unit tests that exercise some of the run() parsing functionality. Examples still need work to conform.

          Show
          Jeff Eastman added a comment - Revised patch that fixes some problems uncovered by included unit tests that exercise some of the run() parsing functionality. Examples still need work to conform.
          Hide
          Jeff Eastman added a comment -

          r964507 committed major refactoring of the clustering application drivers and jobs to extend AbstractJob. Use of DefaultOptionCreator has standardized all shared command-line arguments. Revised unit tests excercise the driver run() method for each application. I'm going to go out on a limb and declare that clustering is now consistent with this issue and am dropping it from the affected components list.

          Show
          Jeff Eastman added a comment - r964507 committed major refactoring of the clustering application drivers and jobs to extend AbstractJob. Use of DefaultOptionCreator has standardized all shared command-line arguments. Revised unit tests excercise the driver run() method for each application. I'm going to go out on a limb and declare that clustering is now consistent with this issue and am dropping it from the affected components list.
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #137 (See http://hudson.zones.apache.org/hudson/job/Mahout-Quality/137/)
          MAHOUT-294:

          • Refactored clustering jobs to subclass AbstractJob.
          • Added fuzzy k-means example to synthetic control examples
          • Added cluster dump to synthetic control examples
          • Fixed _log file access bug in ClusterDumper when run on Hadoop
          • all synthetic control examples run on Hadoop cluster
          • Fuzzy k-Means produces numerically odd-looking clusters
          • added unit tests of run() command line option for each clustering algorithm
          • all unit tests run
          Show
          Hudson added a comment - Integrated in Mahout-Quality #137 (See http://hudson.zones.apache.org/hudson/job/Mahout-Quality/137/ ) MAHOUT-294 : Refactored clustering jobs to subclass AbstractJob. Added fuzzy k-means example to synthetic control examples Added cluster dump to synthetic control examples Fixed _log file access bug in ClusterDumper when run on Hadoop all synthetic control examples run on Hadoop cluster Fuzzy k-Means produces numerically odd-looking clusters added unit tests of run() command line option for each clustering algorithm all unit tests run
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #138 (See http://hudson.zones.apache.org/hudson/job/Mahout-Quality/138/)
          MAHOUT-294:

          • fixed serialization bug in FuzzyKMeansInfo that was not serializing the combinerState and led to inflated cluster centroid values
          • removed combiner TODO that indicates problem has been in trunk for some time
          • added a unit test testFuzzyKMeansInfoSerialization()
          • TestClusterDumper and SyntheticControl now produce reasonable-looking fuzzyK results
          • all unit tests run
          Show
          Hudson added a comment - Integrated in Mahout-Quality #138 (See http://hudson.zones.apache.org/hudson/job/Mahout-Quality/138/ ) MAHOUT-294 : fixed serialization bug in FuzzyKMeansInfo that was not serializing the combinerState and led to inflated cluster centroid values removed combiner TODO that indicates problem has been in trunk for some time added a unit test testFuzzyKMeansInfoSerialization() TestClusterDumper and SyntheticControl now produce reasonable-looking fuzzyK results all unit tests run
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #140 (See http://hudson.zones.apache.org/hudson/job/Mahout-Quality/140/)
          MAHOUT-294: Another bit of polishing to replace inlined option building sequences with multi-parameter addOption calls. All tests run

          Show
          Hudson added a comment - Integrated in Mahout-Quality #140 (See http://hudson.zones.apache.org/hudson/job/Mahout-Quality/140/ ) MAHOUT-294 : Another bit of polishing to replace inlined option building sequences with multi-parameter addOption calls. All tests run
          Hide
          Jeff Eastman added a comment -

          Sigh. Adding clustering back as an affected component. This is a large issue across all of Mahout and further refactoring is needed in many areas. Testability of the command line option processing is complicated by the fact that run() bundles this in with running the job so the command line stuff cannot be tested in isolation. This makes testing all of the argument corner-cases tedious and unnecessarily time-consuming. I'm going to look at factoring the options parsing out of run.

          Show
          Jeff Eastman added a comment - Sigh. Adding clustering back as an affected component. This is a large issue across all of Mahout and further refactoring is needed in many areas. Testability of the command line option processing is complicated by the fact that run() bundles this in with running the job so the command line stuff cannot be tested in isolation. This makes testing all of the argument corner-cases tedious and unnecessarily time-consuming. I'm going to look at factoring the options parsing out of run.
          Hide
          Drew Farris added a comment -

          bq, Testability of the command line option processing is complicated by the fact that run() bundles this in with running the job so the command line stuff cannot be tested in isolation. This makes testing all of the argument corner-cases tedious and unnecessarily time-consuming. I'm going to look at factoring the options parsing out of run.

          This will most definitely be a good thing to do. I've always wondered if more of the option processing can be pushed up to AbstractJob. Once the arguments hash is obtained it still seems that there is a lot of work to be done prior to launching each job

          Show
          Drew Farris added a comment - bq, Testability of the command line option processing is complicated by the fact that run() bundles this in with running the job so the command line stuff cannot be tested in isolation. This makes testing all of the argument corner-cases tedious and unnecessarily time-consuming. I'm going to look at factoring the options parsing out of run. This will most definitely be a good thing to do. I've always wondered if more of the option processing can be pushed up to AbstractJob. Once the arguments hash is obtained it still seems that there is a lot of work to be done prior to launching each job
          Hide
          Jeff Eastman added a comment -

          Here's a stab at improving the testability of AbstractJob options parsing. It adds an argMap variable in AbstractJob and adds new getOption() and hasOption() methods which encapsulate the "--" prepending, avoiding additional constants. By factoring out ClusterDumper.addOptions() as a public method it allows unit testing of the command line processing without invoking the cluster dumper. We could require this in all subclasses by adding AbstractJob.run() and calling a new abstract addOptions() from it. That will have broad impact on all drivers and I have not done it in this patch.

          As a further step, one could imagine moving all of the common options from DefaultOptionCreator into AbstractJob. This would have all of the Mahout shared command line options in a single place; improving consistency.

          Comments on this approach are welcome. I'm gone for the weekend.

          Show
          Jeff Eastman added a comment - Here's a stab at improving the testability of AbstractJob options parsing. It adds an argMap variable in AbstractJob and adds new getOption() and hasOption() methods which encapsulate the "--" prepending, avoiding additional constants. By factoring out ClusterDumper.addOptions() as a public method it allows unit testing of the command line processing without invoking the cluster dumper. We could require this in all subclasses by adding AbstractJob.run() and calling a new abstract addOptions() from it. That will have broad impact on all drivers and I have not done it in this patch. As a further step, one could imagine moving all of the common options from DefaultOptionCreator into AbstractJob. This would have all of the Mahout shared command line options in a single place; improving consistency. Comments on this approach are welcome. I'm gone for the weekend.
          Hide
          Drew Farris added a comment -

          The latest patch looks good to me. I updated CollocDriver to use the getOption/hasOption methods and it is nice to use the same constants for those that are used to define the option themselves.

          As far as DefaultOptionCreator is concerned, I agree it appears that it could be merged in, but I'm uncertain where it's used outside of the latest AbstractJob work.

          Show
          Drew Farris added a comment - The latest patch looks good to me. I updated CollocDriver to use the getOption/hasOption methods and it is nice to use the same constants for those that are used to define the option themselves. As far as DefaultOptionCreator is concerned, I agree it appears that it could be merged in, but I'm uncertain where it's used outside of the latest AbstractJob work.
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #144 (See http://hudson.zones.apache.org/hudson/job/Mahout-Quality/144/)
          MAHOUT-294:

          • added argMap local to AbstractJob to allow option accessing abstraction
          • added hasOption(), getOption() and optionKey() to AbstractJob to encapsulate all – prepending
          • revised all clustering components to use hasOption()/getOption()
          • added MahoutTestCase.optKey() to use optionKey() to encapsulate all – prepending
          • decided not to continue with public addOptions() approach in earlier patch
          • removed all public OPTION_KEY constants in DefaultOptionsCreator but made
            respective OPTION constants public
          • revised all clustering tests to use optKey
          • made interClusterDensity, intraClusterDensity and separation() public in CDbwEvaluator
          • all unit tests run
          Show
          Hudson added a comment - Integrated in Mahout-Quality #144 (See http://hudson.zones.apache.org/hudson/job/Mahout-Quality/144/ ) MAHOUT-294 : added argMap local to AbstractJob to allow option accessing abstraction added hasOption(), getOption() and optionKey() to AbstractJob to encapsulate all – prepending revised all clustering components to use hasOption()/getOption() added MahoutTestCase.optKey() to use optionKey() to encapsulate all – prepending decided not to continue with public addOptions() approach in earlier patch removed all public OPTION_KEY constants in DefaultOptionsCreator but made respective OPTION constants public revised all clustering tests to use optKey made interClusterDensity, intraClusterDensity and separation() public in CDbwEvaluator all unit tests run
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #151 (See http://hudson.zones.apache.org/hudson/job/Mahout-Quality/151/)

          Show
          Hudson added a comment - Integrated in Mahout-Quality #151 (See http://hudson.zones.apache.org/hudson/job/Mahout-Quality/151/ )
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #152 (See http://hudson.zones.apache.org/hudson/job/Mahout-Quality/152/)
          MAHOUT-294:

          • added method=sequential|mapreduce argument to MeanShift, k-Means and FuzzyK to allow reference implementations to be exercised from the command line
          • added/updated unit tests to verify sequential implementations from command line
          • all tests run
          Show
          Hudson added a comment - Integrated in Mahout-Quality #152 (See http://hudson.zones.apache.org/hudson/job/Mahout-Quality/152/ ) MAHOUT-294 : added method=sequential|mapreduce argument to MeanShift, k-Means and FuzzyK to allow reference implementations to be exercised from the command line added/updated unit tests to verify sequential implementations from command line all tests run
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #155 (See http://hudson.zones.apache.org/hudson/job/Mahout-Quality/155/)
          MAHOUT-294:

          • modified most DisplayClustering subclasses to use the new sequential method on drivers. (Not Dirichlet yet)
          • using file system to transmit Clusters required a rework since they were not serializing needed state
          • refactored Canopy, Cluster, SoftCluster and MeanShiftCanopy significantly, abstracting shared behavior to new AbstractCluster class.
          • deleted ClusterBase after moving static method to AbstractCluster
          • added ClusterObservations to replace KMeansInfo and FuzzyKMeansInfo
          • changed all cluster identifier formatting to include type indication
          • upshot of new clusters is improved posterior statistics for all with radius() now returning stdDev(), a vector
          • new radius() used in Display examples to show elliptical clusters
          • adjusted unit tests and all pass

          probably should have made a new JIRA for some of this

          Show
          Hudson added a comment - Integrated in Mahout-Quality #155 (See http://hudson.zones.apache.org/hudson/job/Mahout-Quality/155/ ) MAHOUT-294 : modified most DisplayClustering subclasses to use the new sequential method on drivers. (Not Dirichlet yet) using file system to transmit Clusters required a rework since they were not serializing needed state refactored Canopy, Cluster, SoftCluster and MeanShiftCanopy significantly, abstracting shared behavior to new AbstractCluster class. deleted ClusterBase after moving static method to AbstractCluster added ClusterObservations to replace KMeansInfo and FuzzyKMeansInfo changed all cluster identifier formatting to include type indication upshot of new clusters is improved posterior statistics for all with radius() now returning stdDev(), a vector new radius() used in Display examples to show elliptical clusters adjusted unit tests and all pass probably should have made a new JIRA for some of this
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #274 (See https://hudson.apache.org/hudson/job/Mahout-Quality/274/)
          MAHOUT-294: Factored a run(...) method out of EigenVerificationJob.run([]) so it can be called from TestClusterDumper. Made a few of its other methods private that were never used outside. All tests run

          Show
          Hudson added a comment - Integrated in Mahout-Quality #274 (See https://hudson.apache.org/hudson/job/Mahout-Quality/274/ ) MAHOUT-294 : Factored a run(...) method out of EigenVerificationJob.run([]) so it can be called from TestClusterDumper. Made a few of its other methods private that were never used outside. All tests run
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #276 (See https://hudson.apache.org/hudson/job/Mahout-Quality/276/)
          MAHOUT-294: Added arguments to the DistributedLanczosSolver CLI so that
          it can optionally launch the EigenVerificationJob. Added a new run() method with
          additional arguments needed by EVJ. Added new unit tests of the CLI and adjusted
          some of the cluster dumper tests to use the new run method. All tests run.

          NOTE: this patch changes the semantics of the DLS --output argument: Formerly
          it was the path to the output (raw) eigenvectors file. Now it is a path to an output
          directory in which a rawEigenvectors file will be written. If the EVJ is also specified,
          then the output directory will contain a cleanEigenvectors file in addition.

          Show
          Hudson added a comment - Integrated in Mahout-Quality #276 (See https://hudson.apache.org/hudson/job/Mahout-Quality/276/ ) MAHOUT-294 : Added arguments to the DistributedLanczosSolver CLI so that it can optionally launch the EigenVerificationJob. Added a new run() method with additional arguments needed by EVJ. Added new unit tests of the CLI and adjusted some of the cluster dumper tests to use the new run method. All tests run. NOTE: this patch changes the semantics of the DLS --output argument: Formerly it was the path to the output (raw) eigenvectors file. Now it is a path to an output directory in which a rawEigenvectors file will be written. If the EVJ is also specified, then the output directory will contain a cleanEigenvectors file in addition.
          Hide
          Sean Owen added a comment -

          There is plenty of progress here, though much code still uses old Hadoop APIs and specialized parameters. There is enough work left to do that it must be for 0.5.

          Show
          Sean Owen added a comment - There is plenty of progress here, though much code still uses old Hadoop APIs and specialized parameters. There is enough work left to do that it must be for 0.5.
          Hide
          Sean Owen added a comment -

          I think this over-arching issue to try to get off the old Hadoop APIs remains important but must get kicked down the road a bit. I think it must block 1.0 however. We do need to understand how the existing old code will finally be updated.

          Show
          Sean Owen added a comment - I think this over-arching issue to try to get off the old Hadoop APIs remains important but must get kicked down the road a bit. I think it must block 1.0 however. We do need to understand how the existing old code will finally be updated.
          Hide
          Sean Owen added a comment -

          I kinda give up on this. I don't see efforts to clean out the remaining usages of the old Hadoop APIs. It is probably just how it is going to be for that code until it's deleted.

          Show
          Sean Owen added a comment - I kinda give up on this. I don't see efforts to clean out the remaining usages of the old Hadoop APIs. It is probably just how it is going to be for that code until it's deleted.

            People

            • Assignee:
              Sean Owen
              Reporter:
              Robin Anil
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Due:
                Created:
                Updated:
                Resolved:

                Development