Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.1
    • Component/s: None
    • Labels:
      None

      Description

      A utility package used to

      • configure class
      • create default configuration files
      • parse main method arguments
      • produce human readable help
      • getters and setters for value as object and string, for future generic reflection based GUI.
      1. MAHOUT-49.txt
        51 kB
        Karl Wettin
      2. MAHOUT-49.txt
        25 kB
        Karl Wettin

        Activity

        Hide
        Karl Wettin added a comment - - edited

        Something like this:

        public interface ParameterEnumerable {
        
          public abstract int numberOfParameters();
          public abstract Enumeration<Parameter> enumerateParameters();
        
          public interface Parameter<T> {
            /** @return */
            public abstract String name();
            /** @return */
            public abstract String description();
            /** @return */
            public abstract Class<T> type();
            /** @param stringValue */
            public abstract void setStringValue(String stringValue);
            /** @return */
            public abstract String getStringValue();
            /** @param value*/
            public abstract void setValue(T value);
            /** @return */
            public abstract T getValue();
            /** @return */
            public abstract T defaultValue();
        
          }
        }
        
        Show
        Karl Wettin added a comment - - edited Something like this: public interface ParameterEnumerable { public abstract int numberOfParameters(); public abstract Enumeration<Parameter> enumerateParameters(); public interface Parameter<T> { /** @ return */ public abstract String name(); /** @ return */ public abstract String description(); /** @ return */ public abstract Class <T> type(); /** @param stringValue */ public abstract void setStringValue( String stringValue); /** @ return */ public abstract String getStringValue(); /** @param value*/ public abstract void setValue(T value); /** @ return */ public abstract T getValue(); /** @ return */ public abstract T defaultValue(); } }
        Hide
        Sean Owen added a comment -

        Is this anything to do with parsing command line and providing help on options? Sounds like something that Commons CLI handles – or if you're suggesting something merely similar, maybe still a package that can be reused partially.

        Show
        Sean Owen added a comment - Is this anything to do with parsing command line and providing help on options? Sounds like something that Commons CLI handles – or if you're suggesting something merely similar, maybe still a package that can be reused partially.
        Hide
        Karl Wettin added a comment -

        Is this anything to do with parsing command line and providing help on options?

        Not so much command line. Thats usually just a path to where all files are located.

        This is more about producing help on the parameters one can pass to the Hadoop job in the configuration file: what the training data in the path is called, what leven to set some knob, what class strategy to use for something, the settings for that strategy, et c.

        I also want it to handle composite configuration. If I set a distance measure class it should iterate the new settings of that too.

        With getters and setters in the parameters its fairly simple to create a GUI for setting up jobs. Weka does similar things in their explorer GUI.

        Show
        Karl Wettin added a comment - Is this anything to do with parsing command line and providing help on options? Not so much command line. Thats usually just a path to where all files are located. This is more about producing help on the parameters one can pass to the Hadoop job in the configuration file: what the training data in the path is called, what leven to set some knob, what class strategy to use for something, the settings for that strategy, et c. I also want it to handle composite configuration. If I set a distance measure class it should iterate the new settings of that too. With getters and setters in the parameters its fairly simple to create a GUI for setting up jobs. Weka does similar things in their explorer GUI.
        Hide
        Karl Wettin added a comment - - edited
        • Parametered
        • Parameter
        • ParameterUtils - prints settings to different output formats
        • AbstractParameter
        • CompositeParameter - a class of some sort that can contain more parameters.
        • PathParameter
        • DoubleParameter, IntegerParameter, StringParameter

        This is what implementations can look like:

        public class HierarchialClusterDriver implements Tool, Parametered {
        
          private void help() {
            System.out.println("Usage: ./hadoop mahout.job HierarchialClusterDriver [default root path]");
            System.out.println();
            System.out.println(ParameterUtils.help(this));
          }
        
          public static void main(String[] args) throws Exception {
            HierarchialClusterDriver driver = new HierarchialClusterDriver();
            driver.configureParameters(HierarchialClusterDriver.class.getSimpleName() + ".");
            int res = ToolRunner.run(new Configuration(), driver, args);
            System.exit(res);
        
          }
        
          private List<Parameter> parameters;
          private Parameter<Path> defaultRootPath;
          private Parameter<Path> trainingInstancesPath;
          private Parameter<Path> trainingInstancePath;
          private Parameter<Path> treePath;
          private Parameter<Path> treeInstancesPath;
          private Parameter<Double> instanceDistancePruneThreadshold;
          private Parameter<DistanceMeasure> distanceMeasure;
          private Parameter<Path> closestInstanceOutputPath;
          private Parameter<Path> closestNodeOutputPath;
          private Parameter<Path> nodesOutputFile;
          protected Parameter<String> trainingInstanceVectorWritableClass;
        
          public List<Parameter> getParameters() {
            return parameters;
          }
        
          public void configureParameters(String prefix) {
            parameters = new ArrayList<Parameter>(1);
            parameters.add(defaultRootPath = new PathParameter(new Path("/" + HierarchialClusterDriver.class.getSimpleName()), prefix + "defaultRootPath", "Path to where all files are located by default."));
        
            parameters.add(trainingInstancesPath = new PathParameter(new Path(defaultRootPath.get(), "trainingInstances"), prefix + "trainingInstancesPath", "Path to file containing instances to be added to the tree."));
            parameters.add(trainingInstancePath = new PathParameter(new Path(defaultRootPath.get(), "trainingInstance"), prefix + "trainingInstancePath", "Path to temporay file containing the instance to be inserted and currently compared against instances in the tree."));
            parameters.add(trainingInstanceVectorWritableClass = new StringParameter(DenseVector.class.getName(), prefix + "trainingInstanceVectorWritableClass", "VectorWritable class used to read and write trainingInstances and trainingInstance file."));
        
            parameters.add(distanceMeasure = new CompositeParameter<DistanceMeasure>(DistanceMeasure.class, new WeightedEuclideanDistanceMeasure(), prefix + "distanceMeasure", "Distance measure used to calcualte distance between instances."));
            parameters.add(treePath = new PathParameter(new Path(defaultRootPath.get(), "tree"), prefix + "treePath", "Path to directory containing persistent tree."));
            parameters.add(treeInstancesPath = new PathParameter(new Path(treePath.get(), "instances"), prefix + "treeInstancesPath", "Path to temporay file containing all instances currently in tree."));
            parameters.add(instanceDistancePruneThreadshold = new DoubleParameter(0.2d, prefix + "instanceDistancePruneThreadshold", "Instances will share the same leaf node if the distance between them is no more than this value."));
        
            parameters.add(closestInstanceOutputPath = new PathParameter(new Path(defaultRootPath.get(), "closest_instance"), prefix + "closestInstanceOutputPath", "Path to temporay results directoy containing closest instance in tree."));
            parameters.add(closestNodeOutputPath = new PathParameter(new Path(defaultRootPath.get(), "closest_node"), prefix + "closestNodeOutputPath", "Path to temporay results directoy containing closest node in tree.."));
        
            parameters.add(nodesOutputFile = new PathParameter(new Path(defaultRootPath.get(), "nodes_from_leaf_to_root"), prefix + "nodesOutputFile", "Path to temporay file containing nodes between instance and root to be compared against training instance."));
        
            for (Parameter parameter : parameters) {
              parameter.configureParameters(parameter.name() + ".");
            }
          }
        

        Notice that the driver has a Parameter<DistanceMeasure>.

        public abstract class WeightedDistanceMeasure extends AbstractDistanceMeasure {
          protected List<Parameter> parameters;
          protected Parameter<String> weightsFile;
          protected Parameter<String> vectorWritableClass;
          protected Vector weights;
        
        
          public void configureParameters(String prefix) {
            parameters = new ArrayList<Parameter>(2);
            parameters.add(weightsFile = new StringParameter(null, prefix + "weightsFile", "Path on DFS to a file containing the weights."));
            parameters.add(vectorWritableClass = new StringParameter(DenseVector.class.getName(), prefix + "vectorWritableClass", "VectorWritable class used to read file specified in parameter weightsFile."));
          }
        
          public Collection<Parameter> getParameters() {
            return parameters;
          }
        
          public void configure(JobConf jobConf) {
            if (parameters == null) {
              configureParameters(WeightedDistanceMeasure.class.getName() + ".");
            }
            try {
              FileSystem fs = FileSystem.get(jobConf);
              if (weightsFile.get() != null) {
                VectorWritable writable = (VectorWritable) Class.forName(vectorWritableClass.get()).newInstance();
        

        Here is the output from ParameterUtil.help(driver):

        Usage: ./hadoop mahout.job HierarchialClusterDriver [root path]
        
        HierarchialClusterDriver.defaultRootPath                        Path to where all files are located by default. (default value '/HierarchialClusterDriver')
        HierarchialClusterDriver.trainingInstancesPath                  Path to file containing instances to be added to the tree. (default value '/HierarchialClusterDriver/trainingInstances')
        HierarchialClusterDriver.trainingInstancePath                   Path to temporay file containing the instance to be inserted and currently compared against instances in the tree. (default value '/HierarchialClusterDriver/trainingInstance')
        HierarchialClusterDriver.trainingInstanceVectorWritableClass    VectorWritable class used to read and write trainingInstances and trainingInstance file. (default value 'org.apache.mahout.matrix.DenseVector')
        HierarchialClusterDriver.distanceMeasure                        Distance measure used to calcualte distance between instances. (default value 'org.apache.mahout.utils.WeightedEuclideanDistanceMeasure')
        

        the next two lines are the composite parts of the parameter in previous line.

        HierarchialClusterDriver.distanceMeasure.weightsFile            Path on DFS to a file containing the weights.
        HierarchialClusterDriver.distanceMeasure.vectorWritableClass    VectorWritable class used to read file specified in parameter weightsFile. (default value 'org.apache.mahout.matrix.DenseVector')
        HierarchialClusterDriver.treePath                               Path to directory containing persistent tree. (default value '/HierarchialClusterDriver/tree')
        HierarchialClusterDriver.treeInstancesPath                      Path to temporay file containing all instances currently in tree. (default value '/HierarchialClusterDriver/tree/instances')
        HierarchialClusterDriver.instanceDistancePruneThreadshold       Instances will share the same leaf node if the distance between them is no more than this value. (default value '0.2')
        HierarchialClusterDriver.closestInstanceOutputPath              Path to temporay results directoy containing closest instance in tree. (default value '/HierarchialClusterDriver/closest_instance')
        HierarchialClusterDriver.closestNodeOutputPath                  Path to temporay results directoy containing closest node in tree.. (default value '/HierarchialClusterDriver/closest_node')
        HierarchialClusterDriver.nodesOutputFile                        Path to temporay file containing nodes between instance and root to be compared against training instance. (default value '/HierarchialClusterDriver/nodes_from_leaf_to_root')
        

        And this is the output from ParameterUtil.conf(driver):

        # Path to where all files are located by default.
        HierarchialClusterDriver.defaultRootPath = /HierarchialClusterDriver
        
        # Path to file containing instances to be added to the tree.
        HierarchialClusterDriver.trainingInstancesPath = /HierarchialClusterDriver/trainingInstances
        
        # Path to temporay file containing the instance to be inserted and currently compared against instances in the tree.
        HierarchialClusterDriver.trainingInstancePath = /HierarchialClusterDriver/trainingInstance
        
        # VectorWritable class used to read and write trainingInstances and trainingInstance file.
        HierarchialClusterDriver.trainingInstanceVectorWritableClass = org.apache.mahout.matrix.DenseVector
        
        # Distance measure used to calcualte distance between instances.
        HierarchialClusterDriver.distanceMeasure = org.apache.mahout.utils.WeightedEuclideanDistanceMeasure
        
        # Path on DFS to a file containing the weights.
        HierarchialClusterDriver.distanceMeasure.weightsFile = 
        
        # VectorWritable class used to read file specified in parameter weightsFile.
        HierarchialClusterDriver.distanceMeasure.vectorWritableClass = org.apache.mahout.matrix.DenseVector
        
        # Path to directory containing persistent tree.
        HierarchialClusterDriver.treePath = /HierarchialClusterDriver/tree
        
        # Path to temporay file containing all instances currently in tree.
        HierarchialClusterDriver.treeInstancesPath = /HierarchialClusterDriver/tree/instances
        
        # Instances will share the same leaf node if the distance between them is no more than this value.
        HierarchialClusterDriver.instanceDistancePruneThreadshold = 0.2
        
        # Path to temporay results directoy containing closest instance in tree.
        HierarchialClusterDriver.closestInstanceOutputPath = /HierarchialClusterDriver/closest_instance
        
        # Path to temporay results directoy containing closest node in tree..
        HierarchialClusterDriver.closestNodeOutputPath = /HierarchialClusterDriver/closest_node
        
        # Path to temporay file containing nodes between instance and root to be compared against training instance.
        HierarchialClusterDriver.nodesOutputFile = /HierarchialClusterDriver/nodes_from_leaf_to_root
        
        Show
        Karl Wettin added a comment - - edited Parametered Parameter ParameterUtils - prints settings to different output formats AbstractParameter CompositeParameter - a class of some sort that can contain more parameters. PathParameter DoubleParameter, IntegerParameter, StringParameter This is what implementations can look like: public class HierarchialClusterDriver implements Tool, Parametered { private void help() { System .out.println( "Usage: ./hadoop mahout.job HierarchialClusterDriver [ default root path]" ); System .out.println(); System .out.println(ParameterUtils.help( this )); } public static void main( String [] args) throws Exception { HierarchialClusterDriver driver = new HierarchialClusterDriver(); driver.configureParameters(HierarchialClusterDriver.class.getSimpleName() + "." ); int res = ToolRunner.run( new Configuration(), driver, args); System .exit(res); } private List<Parameter> parameters; private Parameter<Path> defaultRootPath; private Parameter<Path> trainingInstancesPath; private Parameter<Path> trainingInstancePath; private Parameter<Path> treePath; private Parameter<Path> treeInstancesPath; private Parameter< Double > instanceDistancePruneThreadshold; private Parameter<DistanceMeasure> distanceMeasure; private Parameter<Path> closestInstanceOutputPath; private Parameter<Path> closestNodeOutputPath; private Parameter<Path> nodesOutputFile; protected Parameter< String > trainingInstanceVectorWritableClass; public List<Parameter> getParameters() { return parameters; } public void configureParameters( String prefix) { parameters = new ArrayList<Parameter>(1); parameters.add(defaultRootPath = new PathParameter( new Path( "/" + HierarchialClusterDriver.class.getSimpleName()), prefix + "defaultRootPath" , "Path to where all files are located by default ." )); parameters.add(trainingInstancesPath = new PathParameter( new Path(defaultRootPath.get(), "trainingInstances" ), prefix + "trainingInstancesPath" , "Path to file containing instances to be added to the tree." )); parameters.add(trainingInstancePath = new PathParameter( new Path(defaultRootPath.get(), "trainingInstance" ), prefix + "trainingInstancePath" , "Path to temporay file containing the instance to be inserted and currently compared against instances in the tree." )); parameters.add(trainingInstanceVectorWritableClass = new StringParameter(DenseVector.class.getName(), prefix + "trainingInstanceVectorWritableClass" , "VectorWritable class used to read and write trainingInstances and trainingInstance file." )); parameters.add(distanceMeasure = new CompositeParameter<DistanceMeasure>(DistanceMeasure.class, new WeightedEuclideanDistanceMeasure(), prefix + "distanceMeasure" , "Distance measure used to calcualte distance between instances." )); parameters.add(treePath = new PathParameter( new Path(defaultRootPath.get(), "tree" ), prefix + "treePath" , "Path to directory containing persistent tree." )); parameters.add(treeInstancesPath = new PathParameter( new Path(treePath.get(), "instances" ), prefix + "treeInstancesPath" , "Path to temporay file containing all instances currently in tree." )); parameters.add(instanceDistancePruneThreadshold = new DoubleParameter(0.2d, prefix + "instanceDistancePruneThreadshold" , "Instances will share the same leaf node if the distance between them is no more than this value." )); parameters.add(closestInstanceOutputPath = new PathParameter( new Path(defaultRootPath.get(), "closest_instance" ), prefix + "closestInstanceOutputPath" , "Path to temporay results directoy containing closest instance in tree." )); parameters.add(closestNodeOutputPath = new PathParameter( new Path(defaultRootPath.get(), "closest_node" ), prefix + "closestNodeOutputPath" , "Path to temporay results directoy containing closest node in tree.." )); parameters.add(nodesOutputFile = new PathParameter( new Path(defaultRootPath.get(), "nodes_from_leaf_to_root" ), prefix + "nodesOutputFile" , "Path to temporay file containing nodes between instance and root to be compared against training instance." )); for (Parameter parameter : parameters) { parameter.configureParameters(parameter.name() + "." ); } } Notice that the driver has a Parameter<DistanceMeasure>. public abstract class WeightedDistanceMeasure extends AbstractDistanceMeasure { protected List<Parameter> parameters; protected Parameter< String > weightsFile; protected Parameter< String > vectorWritableClass; protected Vector weights; public void configureParameters( String prefix) { parameters = new ArrayList<Parameter>(2); parameters.add(weightsFile = new StringParameter( null , prefix + "weightsFile" , "Path on DFS to a file containing the weights." )); parameters.add(vectorWritableClass = new StringParameter(DenseVector.class.getName(), prefix + "vectorWritableClass" , "VectorWritable class used to read file specified in parameter weightsFile." )); } public Collection<Parameter> getParameters() { return parameters; } public void configure(JobConf jobConf) { if (parameters == null ) { configureParameters(WeightedDistanceMeasure.class.getName() + "." ); } try { FileSystem fs = FileSystem.get(jobConf); if (weightsFile.get() != null ) { VectorWritable writable = (VectorWritable) Class .forName(vectorWritableClass.get()).newInstance(); Here is the output from ParameterUtil.help(driver): Usage: ./hadoop mahout.job HierarchialClusterDriver [root path] HierarchialClusterDriver.defaultRootPath Path to where all files are located by default. (default value '/HierarchialClusterDriver') HierarchialClusterDriver.trainingInstancesPath Path to file containing instances to be added to the tree. (default value '/HierarchialClusterDriver/trainingInstances') HierarchialClusterDriver.trainingInstancePath Path to temporay file containing the instance to be inserted and currently compared against instances in the tree. (default value '/HierarchialClusterDriver/trainingInstance') HierarchialClusterDriver.trainingInstanceVectorWritableClass VectorWritable class used to read and write trainingInstances and trainingInstance file. (default value 'org.apache.mahout.matrix.DenseVector') HierarchialClusterDriver.distanceMeasure Distance measure used to calcualte distance between instances. (default value 'org.apache.mahout.utils.WeightedEuclideanDistanceMeasure') the next two lines are the composite parts of the parameter in previous line. HierarchialClusterDriver.distanceMeasure.weightsFile Path on DFS to a file containing the weights. HierarchialClusterDriver.distanceMeasure.vectorWritableClass VectorWritable class used to read file specified in parameter weightsFile. (default value 'org.apache.mahout.matrix.DenseVector') HierarchialClusterDriver.treePath Path to directory containing persistent tree. (default value '/HierarchialClusterDriver/tree') HierarchialClusterDriver.treeInstancesPath Path to temporay file containing all instances currently in tree. (default value '/HierarchialClusterDriver/tree/instances') HierarchialClusterDriver.instanceDistancePruneThreadshold Instances will share the same leaf node if the distance between them is no more than this value. (default value '0.2') HierarchialClusterDriver.closestInstanceOutputPath Path to temporay results directoy containing closest instance in tree. (default value '/HierarchialClusterDriver/closest_instance') HierarchialClusterDriver.closestNodeOutputPath Path to temporay results directoy containing closest node in tree.. (default value '/HierarchialClusterDriver/closest_node') HierarchialClusterDriver.nodesOutputFile Path to temporay file containing nodes between instance and root to be compared against training instance. (default value '/HierarchialClusterDriver/nodes_from_leaf_to_root') And this is the output from ParameterUtil.conf(driver): # Path to where all files are located by default. HierarchialClusterDriver.defaultRootPath = /HierarchialClusterDriver # Path to file containing instances to be added to the tree. HierarchialClusterDriver.trainingInstancesPath = /HierarchialClusterDriver/trainingInstances # Path to temporay file containing the instance to be inserted and currently compared against instances in the tree. HierarchialClusterDriver.trainingInstancePath = /HierarchialClusterDriver/trainingInstance # VectorWritable class used to read and write trainingInstances and trainingInstance file. HierarchialClusterDriver.trainingInstanceVectorWritableClass = org.apache.mahout.matrix.DenseVector # Distance measure used to calcualte distance between instances. HierarchialClusterDriver.distanceMeasure = org.apache.mahout.utils.WeightedEuclideanDistanceMeasure # Path on DFS to a file containing the weights. HierarchialClusterDriver.distanceMeasure.weightsFile = # VectorWritable class used to read file specified in parameter weightsFile. HierarchialClusterDriver.distanceMeasure.vectorWritableClass = org.apache.mahout.matrix.DenseVector # Path to directory containing persistent tree. HierarchialClusterDriver.treePath = /HierarchialClusterDriver/tree # Path to temporay file containing all instances currently in tree. HierarchialClusterDriver.treeInstancesPath = /HierarchialClusterDriver/tree/instances # Instances will share the same leaf node if the distance between them is no more than this value. HierarchialClusterDriver.instanceDistancePruneThreadshold = 0.2 # Path to temporay results directoy containing closest instance in tree. HierarchialClusterDriver.closestInstanceOutputPath = /HierarchialClusterDriver/closest_instance # Path to temporay results directoy containing closest node in tree.. HierarchialClusterDriver.closestNodeOutputPath = /HierarchialClusterDriver/closest_node # Path to temporay file containing nodes between instance and root to be compared against training instance. HierarchialClusterDriver.nodesOutputFile = /HierarchialClusterDriver/nodes_from_leaf_to_root
        Hide
        Karl Wettin added a comment -

        Patch contains a pattern to recursively configure composite JobConfigurable classes.

        public interface Parametered extends JobConfigurable {
          public static final Log log = LogFactory.getLog(Parametered.class);
          public abstract Collection<Parameter> getParameters();
          public abstract void createParameters(String prefix, JobConf jobConf);
        }
        

        Simple non composite use:

        public abstract class WeightedDistanceMeasure extends AbstractDistanceMeasure {
        
          protected List<Parameter> parameters;
          protected Parameter<Path> weightsFile;
          protected Parameter<Class> vectorWritableClass;
          protected Vector weights;
        
        
           public void createParameters(String prefix, JobConf jobConf) {
            parameters = new ArrayList<Parameter>();
            parameters.add(weightsFile = new PathParameter(prefix, "weightsFile", jobConf, null, "Path on DFS to a file containing the weights."));
            parameters.add(vectorWritableClass = new ClassParameter(prefix, "vectorWritableClass", jobConf, DenseVector.class, "Class<Vector> file specified in parameter weightsFile has been serialized with."));
          }
        
          public Collection<Parameter> getParameters() {
            return parameters;
          }
        
          public void configure(JobConf jobConf) {
            if (parameters == null) {
              ParameteredGeneralizations.configureParameters(this, jobConf);
            }    
        

        HierarchialClusterDriver contains a whole bunch of parameters of different sort. tree contains more parameters. See help output futher down.

          private List<Parameter> parameters;
          private Parameter<Path> dfsRootPath;
          private Parameter<Path> trainingInstancesFile;
          private Parameter<Path> trainingInstanceFile;
          private Parameter<Path> appendingInstancesFile;
          private Parameter<Tree> tree;
          private Parameter<Double> instanceDistancePruneThreadshold;
          private Parameter<Path> closestInstanceOutputPath;
          private Parameter<Path> closestNodeOutputPath;
          private Parameter<Path> nodesOutputFile;
          private Parameter<Class> trainingInstanceVectorClass;
        
        
          public List<Parameter> getParameters() {
            return parameters;
          }
        
          public void createParameters(String prefix, JobConf jobConf) {
            parameters = new ArrayList<Parameter>();
            parameters.add(dfsRootPath = new PathParameter(prefix, "dfsRootPath", jobConf, new Path("/" + HierarchialClusterDriver.class.getSimpleName()), "Path to where all files on DFS are located by default."));
            
            parameters.add(trainingInstancesFile = new PathParameter(prefix, "trainingInstancesFile", jobConf, new Path(dfsRootPath.get(), "trainingInstances"), "Path to file containing instances to be added to the tree."));
            parameters.add(trainingInstanceFile = new PathParameter(prefix, "trainingInstanceFile", jobConf, new Path(dfsRootPath.get(), "trainingInstance"), "Path to temporay file containing the instance to be inserted and currently compared against instances in the tree."));
            parameters.add(trainingInstanceVectorClass = new ClassParameter(prefix, "trainingInstanceVectorClass", jobConf, DenseVector.class, "Class<Vector> used to read and write trainingInstances and trainingInstance file."));
        
            parameters.add(appendingInstancesFile = new PathParameter(prefix, "appendingInstancesFile", jobConf, new Path(dfsRootPath.get(), "appendingInstances"), "Path to temporay file containing all the instances to be measured against when inserting a new instance."));
        
            parameters.add(tree = new CompositeParameter<Tree>(Tree.class, prefix, "tree", jobConf, new PhmTree(), "Class<Tree> used to store instances and their 2-dimensional relationships."));
            parameters.add(instanceDistancePruneThreadshold = new DoubleParameter(prefix, "instanceDistancePruneThreadshold", jobConf, 0.2d, "Instances will share the same leaf node if the distance between them is no more than this value."));
        
            parameters.add(closestInstanceOutputPath = new PathParameter(prefix, "closestInstanceOutputPath", jobConf, new Path(dfsRootPath.get(), "closest_instance"), "Path to temporay results directoy containing closest instance in tree."));
            parameters.add(closestNodeOutputPath = new PathParameter(prefix, "closestNodeOutputPath", jobConf, new Path(dfsRootPath.get(), "closest_node"), "Path to temporay results directoy containing closest node in tree.."));
        
            parameters.add(nodesOutputFile = new PathParameter(prefix, "nodesOutputFile", jobConf, new Path(dfsRootPath.get(), "nodes_from_leaf_to_root"), "Path to temporay file containing nodes between instance and root to be compared against training instance."));
        
          }
        
          private Configuration conf;
        
          public void configure(JobConf jobConf) {
            if (parameters == null) {
              ParameteredGeneralizations.configureParameters(this, jobConf);
            }
          }
        

        And this is what the help output looks like (a little buggy)

        Usage: ./hadoop mahout.job HierarchialClusterDriver [default dfs root path]
        
        dfsRootPath                             Path to where all files on DFS are located by default. (default value '/HierarchialClusterDriver')
        trainingInstancesFile                   Path to file containing instances to be added to the tree. (default value 'HierarchialClusterDriver/1208901248028/dfs/trainingInstances')
        trainingInstanceFile                    Path to temporay file containing the instance to be inserted and currently compared against instances in the tree. (default value 'HierarchialClusterDriver/1208901248028/dfs/trainingInstance')
        trainingInstanceVectorClass             Class<Vector> used to read and write trainingInstances and trainingInstance file. (default value 'org.apache.mahout.matrix.DenseVector')
        appendingInstancesFile                  Path to temporay file containing all the instances to be measured against when inserting a new instance. (default value 'HierarchialClusterDriver/1208901248028/dfs/appendingInstances')
        tree                                    Class<Tree> used to store instances and their 2-dimensional relationships. (default value 'org.apache.mahout.clustering.hierarchial.tree.phm.PhmTree')
        tree.distanceMeasure                    Class<DistanceMeasure> used to measure distance between instances. (default value 'org.apache.mahout.utils.WeightedEuclideanDistanceMeasure')
        tree.distanceMeasure.weightsFile        Path on DFS to a file containing the weights.
        tree.distanceMeasure.vectorWritableClassClass<Vector> file specified in parameter weightsFile has been serialized with. (default value 'org.apache.mahout.matrix.DenseVector')
        tree.lfsRootPath                        Path to storage root path on local file system. (default value 'PhmTree')
        tree.sequenceManagerFile                Path to primary key sequence manager file on local file system. (default value '/Users/kalle/projekt/apache/mahout/MAHOUT-19/HierarchialClusterDriver/1208901248028/lfs/tree/sequenceManagerFile')
        tree.instanceVectorClass                Class<Vector> used to serialize instances. (default value 'org.apache.mahout.matrix.SparseVector')
        tree.maximumInstances                   Maximum number of instances this tree can fit. (default value '10000')
        instanceDistancePruneThreadshold        Instances will share the same leaf node if the distance between them is no more than this value. (default value '0.2')
        closestInstanceOutputPath               Path to temporay results directoy containing closest instance in tree. (default value 'HierarchialClusterDriver/1208901248028/dfs/closest_instance')
        closestNodeOutputPath                   Path to temporay results directoy containing closest node in tree.. (default value 'HierarchialClusterDriver/1208901248028/dfs/closest_node')
        nodesOutputFile                         Path to temporay file containing nodes between instance and root to be compared against training instance. (default value 'HierarchialClusterDriver/1208901248028/dfs/nodes_from_leaf_to_root')
        

        It can read any of these values from main String[] args or the configuration file. A parameter is also accessible via getter and setter, either as object or using string values:

        public interface Parameter<T> extends Parametered {
          /** @return job configuration setting key prefix */
          public abstract String prefix();
          /** @return configuration parameters name, e.g. org.apache.mahout.util.WeightedDistanceMeasure.weightsFile */
          public abstract String name();
          /** @return human readable description of parameters */
          public abstract String description();
          /** @return value class type */
          public abstract Class<T> type();
          /** @param stringValue value string representation */
          public abstract void setStringValue(String stringValue);
          /** @return value string reprentation of current value */
          public abstract String getStringValue();
          /** @param value new parameters value */
          public abstract void set(T value);
          /** @return current parameters value */
          public abstract T get();
          /** @return value used if not set by consumer */
          public abstract String defaultValue();
        }
        
        Show
        Karl Wettin added a comment - Patch contains a pattern to recursively configure composite JobConfigurable classes. public interface Parametered extends JobConfigurable { public static final Log log = LogFactory.getLog(Parametered.class); public abstract Collection<Parameter> getParameters(); public abstract void createParameters( String prefix, JobConf jobConf); } Simple non composite use: public abstract class WeightedDistanceMeasure extends AbstractDistanceMeasure { protected List<Parameter> parameters; protected Parameter<Path> weightsFile; protected Parameter< Class > vectorWritableClass; protected Vector weights; public void createParameters( String prefix, JobConf jobConf) { parameters = new ArrayList<Parameter>(); parameters.add(weightsFile = new PathParameter(prefix, "weightsFile" , jobConf, null , "Path on DFS to a file containing the weights." )); parameters.add(vectorWritableClass = new ClassParameter(prefix, "vectorWritableClass" , jobConf, DenseVector.class, " Class <Vector> file specified in parameter weightsFile has been serialized with." )); } public Collection<Parameter> getParameters() { return parameters; } public void configure(JobConf jobConf) { if (parameters == null ) { ParameteredGeneralizations.configureParameters( this , jobConf); } HierarchialClusterDriver contains a whole bunch of parameters of different sort. tree contains more parameters. See help output futher down. private List<Parameter> parameters; private Parameter<Path> dfsRootPath; private Parameter<Path> trainingInstancesFile; private Parameter<Path> trainingInstanceFile; private Parameter<Path> appendingInstancesFile; private Parameter<Tree> tree; private Parameter< Double > instanceDistancePruneThreadshold; private Parameter<Path> closestInstanceOutputPath; private Parameter<Path> closestNodeOutputPath; private Parameter<Path> nodesOutputFile; private Parameter< Class > trainingInstanceVectorClass; public List<Parameter> getParameters() { return parameters; } public void createParameters( String prefix, JobConf jobConf) { parameters = new ArrayList<Parameter>(); parameters.add(dfsRootPath = new PathParameter(prefix, "dfsRootPath" , jobConf, new Path( "/" + HierarchialClusterDriver.class.getSimpleName()), "Path to where all files on DFS are located by default ." )); parameters.add(trainingInstancesFile = new PathParameter(prefix, "trainingInstancesFile" , jobConf, new Path(dfsRootPath.get(), "trainingInstances" ), "Path to file containing instances to be added to the tree." )); parameters.add(trainingInstanceFile = new PathParameter(prefix, "trainingInstanceFile" , jobConf, new Path(dfsRootPath.get(), "trainingInstance" ), "Path to temporay file containing the instance to be inserted and currently compared against instances in the tree." )); parameters.add(trainingInstanceVectorClass = new ClassParameter(prefix, "trainingInstanceVectorClass" , jobConf, DenseVector.class, " Class <Vector> used to read and write trainingInstances and trainingInstance file." )); parameters.add(appendingInstancesFile = new PathParameter(prefix, "appendingInstancesFile" , jobConf, new Path(dfsRootPath.get(), "appendingInstances" ), "Path to temporay file containing all the instances to be measured against when inserting a new instance." )); parameters.add(tree = new CompositeParameter<Tree>(Tree.class, prefix, "tree" , jobConf, new PhmTree(), " Class <Tree> used to store instances and their 2-dimensional relationships." )); parameters.add(instanceDistancePruneThreadshold = new DoubleParameter(prefix, "instanceDistancePruneThreadshold" , jobConf, 0.2d, "Instances will share the same leaf node if the distance between them is no more than this value." )); parameters.add(closestInstanceOutputPath = new PathParameter(prefix, "closestInstanceOutputPath" , jobConf, new Path(dfsRootPath.get(), "closest_instance" ), "Path to temporay results directoy containing closest instance in tree." )); parameters.add(closestNodeOutputPath = new PathParameter(prefix, "closestNodeOutputPath" , jobConf, new Path(dfsRootPath.get(), "closest_node" ), "Path to temporay results directoy containing closest node in tree.." )); parameters.add(nodesOutputFile = new PathParameter(prefix, "nodesOutputFile" , jobConf, new Path(dfsRootPath.get(), "nodes_from_leaf_to_root" ), "Path to temporay file containing nodes between instance and root to be compared against training instance." )); } private Configuration conf; public void configure(JobConf jobConf) { if (parameters == null ) { ParameteredGeneralizations.configureParameters( this , jobConf); } } And this is what the help output looks like (a little buggy) Usage: ./hadoop mahout.job HierarchialClusterDriver [default dfs root path] dfsRootPath Path to where all files on DFS are located by default. (default value '/HierarchialClusterDriver') trainingInstancesFile Path to file containing instances to be added to the tree. (default value 'HierarchialClusterDriver/1208901248028/dfs/trainingInstances') trainingInstanceFile Path to temporay file containing the instance to be inserted and currently compared against instances in the tree. (default value 'HierarchialClusterDriver/1208901248028/dfs/trainingInstance') trainingInstanceVectorClass Class<Vector> used to read and write trainingInstances and trainingInstance file. (default value 'org.apache.mahout.matrix.DenseVector') appendingInstancesFile Path to temporay file containing all the instances to be measured against when inserting a new instance. (default value 'HierarchialClusterDriver/1208901248028/dfs/appendingInstances') tree Class<Tree> used to store instances and their 2-dimensional relationships. (default value 'org.apache.mahout.clustering.hierarchial.tree.phm.PhmTree') tree.distanceMeasure Class<DistanceMeasure> used to measure distance between instances. (default value 'org.apache.mahout.utils.WeightedEuclideanDistanceMeasure') tree.distanceMeasure.weightsFile Path on DFS to a file containing the weights. tree.distanceMeasure.vectorWritableClassClass<Vector> file specified in parameter weightsFile has been serialized with. (default value 'org.apache.mahout.matrix.DenseVector') tree.lfsRootPath Path to storage root path on local file system. (default value 'PhmTree') tree.sequenceManagerFile Path to primary key sequence manager file on local file system. (default value '/Users/kalle/projekt/apache/mahout/MAHOUT-19/HierarchialClusterDriver/1208901248028/lfs/tree/sequenceManagerFile') tree.instanceVectorClass Class<Vector> used to serialize instances. (default value 'org.apache.mahout.matrix.SparseVector') tree.maximumInstances Maximum number of instances this tree can fit. (default value '10000') instanceDistancePruneThreadshold Instances will share the same leaf node if the distance between them is no more than this value. (default value '0.2') closestInstanceOutputPath Path to temporay results directoy containing closest instance in tree. (default value 'HierarchialClusterDriver/1208901248028/dfs/closest_instance') closestNodeOutputPath Path to temporay results directoy containing closest node in tree.. (default value 'HierarchialClusterDriver/1208901248028/dfs/closest_node') nodesOutputFile Path to temporay file containing nodes between instance and root to be compared against training instance. (default value 'HierarchialClusterDriver/1208901248028/dfs/nodes_from_leaf_to_root') It can read any of these values from main String[] args or the configuration file. A parameter is also accessible via getter and setter, either as object or using string values: public interface Parameter<T> extends Parametered { /** @ return job configuration setting key prefix */ public abstract String prefix(); /** @ return configuration parameters name, e.g. org.apache.mahout.util.WeightedDistanceMeasure.weightsFile */ public abstract String name(); /** @ return human readable description of parameters */ public abstract String description(); /** @ return value class type */ public abstract Class <T> type(); /** @param stringValue value string representation */ public abstract void setStringValue( String stringValue); /** @ return value string reprentation of current value */ public abstract String getStringValue(); /** @param value new parameters value */ public abstract void set(T value); /** @ return current parameters value */ public abstract T get(); /** @ return value used if not set by consumer */ public abstract String defaultValue(); }

          People

          • Assignee:
            Karl Wettin
            Reporter:
            Karl Wettin
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development