Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.6
    • Fix Version/s: 0.6
    • Component/s: Clustering
    • Labels:

      Description

      Top Down Clustering works in multiple steps. The first step is to find comparative bigger clusters. The second step is to cluster the bigger chunks into meaningful clusters. This can performance while clustering big amount of data. And, it also removes the dependency of providing input clusters/numbers to the clustering algorithm.

      The "big" is a relative term, as well as the smaller "meaningful" terms. So, the control of this "bigger" and "smaller/meaningful" clusters will be controlled by the user.

      Which clustering algorithm to be used in the top level and which to use in the bottom level can also be selected by the user. Initially, it can be done for only one/few clustering algorithms, and later, option can be provided to use all the algorithms ( which suits the case ).

      1. MAHOUT-843-patch
        68 kB
        Paritosh Ranjan
      2. MAHOUT-843-patch-only-postprocessor
        27 kB
        Paritosh Ranjan
      3. MAHOUT-843-patch-only-postprocessor-final
        55 kB
        Paritosh Ranjan
      4. MAHOUT-843-patch-only-postprocessor-v1
        29 kB
        Paritosh Ranjan
      5. MAHOUT-843-patch-only-postprocessor-v2
        29 kB
        Paritosh Ranjan
      6. MAHOUT-843-patch-only-postprocessor-v3
        42 kB
        Paritosh Ranjan
      7. MAHOUT-843-patch-only-postprocessor-v4
        47 kB
        Paritosh Ranjan
      8. MAHOUT-843-patch-only-postprocessor-v5
        54 kB
        Paritosh Ranjan
      9. MAHOUT-843-patch-v1
        82 kB
        Paritosh Ranjan
      10. Top-Down-Clustering-patch
        19 kB
        Paritosh Ranjan

        Activity

        Hide
        Paritosh Ranjan added a comment -

        I am trying to implement top down clustering. I have read the concept from the book Mahout in action. This patch is just for providing feedback on the line of thought to implement it.

        I think that this Top Down Clustering should be flexible and the user should be able to use different clustering algorithms for first and second level of clustering with parameters that suits the user.

        The patch demonstrates the idea. What is left to code, in the patch, is to arrange clustered output from first level clustering algorithm in a directory structure and provide each directory ( clustered points ) to the second level clustering algorithm.

        Please don't consider this patch as the final patch. I am submitting this for feedback, and would welcome suggestions to improve it.

        Show
        Paritosh Ranjan added a comment - I am trying to implement top down clustering. I have read the concept from the book Mahout in action. This patch is just for providing feedback on the line of thought to implement it. I think that this Top Down Clustering should be flexible and the user should be able to use different clustering algorithms for first and second level of clustering with parameters that suits the user. The patch demonstrates the idea. What is left to code, in the patch, is to arrange clustered output from first level clustering algorithm in a directory structure and provide each directory ( clustered points ) to the second level clustering algorithm. Please don't consider this patch as the final patch. I am submitting this for feedback, and would welcome suggestions to improve it.
        Hide
        Paritosh Ranjan added a comment -

        After doing the top level clustering, the output is of the form of "clusterid, vectorid". The problem is, that, the bottom level clustering would need input as a directory of points. So, the points belonging to different clusters should be in different directories.

        This can be done as a post processing step ( after runClustering ). Or it can also be done in the MapReduce Step, if its already known that it is a topdown clustering.

        The MapReduce approach will need some change in all clustering algorithm. But, it will give better performance. The postProcessing approach will not touch any clustering algorithm, but, it will just be an extra step.

        To start with, I am beginning with, the post processing step. As, this will make this patcha a completely clean patch, which could not have any regression.

        Any ideas/suggestions on how to approach this problem?

        Show
        Paritosh Ranjan added a comment - After doing the top level clustering, the output is of the form of "clusterid, vectorid". The problem is, that, the bottom level clustering would need input as a directory of points. So, the points belonging to different clusters should be in different directories. This can be done as a post processing step ( after runClustering ). Or it can also be done in the MapReduce Step, if its already known that it is a topdown clustering. The MapReduce approach will need some change in all clustering algorithm. But, it will give better performance. The postProcessing approach will not touch any clustering algorithm, but, it will just be an extra step. To start with, I am beginning with, the post processing step. As, this will make this patcha a completely clean patch, which could not have any regression. Any ideas/suggestions on how to approach this problem?
        Hide
        Jeff Eastman added a comment -

        I can't get the patch to do anything. It runs ok, but does not add any of the files. I'm left reading over the original patch file in a browser which is not that great.

        What I get from looking at the patch is you are building a ClusterConfig for the top level clustering step and also for the bottom level clustering step. These capture the various parameters for each clustering algorithm. Then, the driver gets an executor from each config that knows how to invoke the top and bottom clustering steps. On the surface, that seems to be a workable approach.

        All this is pure Java though and there is no CLI interface. This seems like the really challenging part, as each of the clustering configs will need a complete set of CLI arguments (e.g. /bin/mahout topdownclustering <top configs> <bottom configs>). For any given combination of top/bottom configs you are going to need a different CLI argument list. Since top and bottom may be the same algorithm but with different parameters (e.g. t1-top, t1-bottom), or different algorithms with overlapping argument names (e.g. dm-top, dm-bottom), I can't think of a good way to approach this, can you?

        Isn't this something that could also be done with a set of shell scripts?

        Show
        Jeff Eastman added a comment - I can't get the patch to do anything. It runs ok, but does not add any of the files. I'm left reading over the original patch file in a browser which is not that great. What I get from looking at the patch is you are building a ClusterConfig for the top level clustering step and also for the bottom level clustering step. These capture the various parameters for each clustering algorithm. Then, the driver gets an executor from each config that knows how to invoke the top and bottom clustering steps. On the surface, that seems to be a workable approach. All this is pure Java though and there is no CLI interface. This seems like the really challenging part, as each of the clustering configs will need a complete set of CLI arguments (e.g. /bin/mahout topdownclustering <top configs> <bottom configs>). For any given combination of top/bottom configs you are going to need a different CLI argument list. Since top and bottom may be the same algorithm but with different parameters (e.g. t1-top, t1-bottom), or different algorithms with overlapping argument names (e.g. dm-top, dm-bottom), I can't think of a good way to approach this, can you? Isn't this something that could also be done with a set of shell scripts?
        Hide
        Jeff Eastman added a comment -

        Figured out why the patch doesn't do what I expected: it is done from inside mahout/core rather than from the top level mahout directory as is customary. The files were added but not in the source tree where they show up with svn st. Looking at the code now in Eclipse, my original comments still apply.

        Show
        Jeff Eastman added a comment - Figured out why the patch doesn't do what I expected: it is done from inside mahout/core rather than from the top level mahout directory as is customary. The files were added but not in the source tree where they show up with svn st. Looking at the code now in Eclipse, my original comments still apply.
        Hide
        Jeff Eastman added a comment -

        +1 for postprocessing the clustered points into different input directories rather than patching all the clustering algorithms . It will be interesting to see how the rest of the bottom code for runClustering develops, as only the top clustering is implemented.

        Show
        Jeff Eastman added a comment - +1 for postprocessing the clustered points into different input directories rather than patching all the clustering algorithms . It will be interesting to see how the rest of the bottom code for runClustering develops, as only the top clustering is implemented.
        Hide
        Paritosh Ranjan added a comment -

        I did not know that. Will do it from parent directory from now onwards. Thanks for letting me know.

        Yes, the code does not work yet because I still need to group points belonging to a similar cluster in their respective directories, and give each cluster directory as the input to the bottom level clustering. I am working on that part and upload a working patch soon.

        One option can be to let the user cluster it manually

        /bin/mahout toplevelclustering <cluster-config>
        /bin/mahout bottomlevelclustering <coluter-config>

        Then we get rid of the duplicate looking arguments. As the only difference would be in input directory of bottom level clustering, which can be derived from input based on whether its a top level or a bottom level clustering ( as the output directory of top level clustering will be controlled by the code).

        Show
        Paritosh Ranjan added a comment - I did not know that. Will do it from parent directory from now onwards. Thanks for letting me know. Yes, the code does not work yet because I still need to group points belonging to a similar cluster in their respective directories, and give each cluster directory as the input to the bottom level clustering. I am working on that part and upload a working patch soon. One option can be to let the user cluster it manually /bin/mahout toplevelclustering <cluster-config> /bin/mahout bottomlevelclustering <coluter-config> Then we get rid of the duplicate looking arguments. As the only difference would be in input directory of bottom level clustering, which can be derived from input based on whether its a top level or a bottom level clustering ( as the output directory of top level clustering will be controlled by the code).
        Hide
        Jeff Eastman added a comment - - edited

        Ok, but how is this better than using the existing jobs?

        bin/mahout canopy <...>
        bin/mahout separateclusters <...> // your new postprocessor job
        for i in <separateclusters output> do
        bin/mahout kmeans <...>
        done

        There is a bit of global argument coupling needed between the -o of canopy and the -i of separateclusters, also between the -o of separateclusters and the -i of kmeans. Are there any other argument couplings? It seems to me that the new postprocessor would be a useful addition to Mahout and the user would then be responsible for wiring the three steps together. Can you think of a way to improve upon this?

        This problem also seems to be similar to the Mahout recommender code: all done by plugging existing Java classes together in a program which is specific for each problem domain/application. There is no CLI for doing this because of the same sorts of issues we are discussing here.

        Show
        Jeff Eastman added a comment - - edited Ok, but how is this better than using the existing jobs? bin/mahout canopy <...> bin/mahout separateclusters <...> // your new postprocessor job for i in <separateclusters output> do bin/mahout kmeans <...> done There is a bit of global argument coupling needed between the -o of canopy and the -i of separateclusters, also between the -o of separateclusters and the -i of kmeans. Are there any other argument couplings? It seems to me that the new postprocessor would be a useful addition to Mahout and the user would then be responsible for wiring the three steps together. Can you think of a way to improve upon this? This problem also seems to be similar to the Mahout recommender code: all done by plugging existing Java classes together in a program which is specific for each problem domain/application. There is no CLI for doing this because of the same sorts of issues we are discussing here.
        Hide
        Paritosh Ranjan added a comment - - edited

        This patch implements TopDownClustering. The class to use it is @TopDownClusteringDriver.

        Top Level Clustering can be done by implementations of @TopLevelClusterConfig and bottom level clustering can be done by all implementations of @BottomLevelClusterConfig which are marker interfaces.

        The concept is, to use different implementations of @ClusterConfig to specify parameters of different clustering algorithms. These @ClusterConfig implementations are passed as parameters specifying top level clustering configuration and bottom level clustering configuration.

        The top level clustering output is post processed using @TopLevelClusterOutputPostProcessor which groups the vectors of similar clusters together. All of these clusters are further processed by bottom level clustering.

        There is a specific implementation of @ClusterExecutor associated with each implementation of @ClusterConfig which uses the cluster config parameters to execute the specific algorithm.

        The output of top level clustering is kept in <output path>/topLevelCluster and the output of bottom level clustering is kept in <output path>/bottomLevelCluster.

        The post processed output of top level cluster is kept in <output path>/topLevelCluster/topLevelClusterPostProcessed/clusterId.

        Both the top and bottom level cluster use the clusterId as the name of the clusters produced.

        I have added javadocs whereever it felt necessary so it would also help you guide through the code. I have tested using @CanopyClusterConfig as top and bottom level cluster config and it works.The other configs should work out of box.

        Show
        Paritosh Ranjan added a comment - - edited This patch implements TopDownClustering. The class to use it is @TopDownClusteringDriver. Top Level Clustering can be done by implementations of @TopLevelClusterConfig and bottom level clustering can be done by all implementations of @BottomLevelClusterConfig which are marker interfaces. The concept is, to use different implementations of @ClusterConfig to specify parameters of different clustering algorithms. These @ClusterConfig implementations are passed as parameters specifying top level clustering configuration and bottom level clustering configuration. The top level clustering output is post processed using @TopLevelClusterOutputPostProcessor which groups the vectors of similar clusters together. All of these clusters are further processed by bottom level clustering. There is a specific implementation of @ClusterExecutor associated with each implementation of @ClusterConfig which uses the cluster config parameters to execute the specific algorithm. The output of top level clustering is kept in <output path>/topLevelCluster and the output of bottom level clustering is kept in <output path>/bottomLevelCluster. The post processed output of top level cluster is kept in <output path>/topLevelCluster/topLevelClusterPostProcessed/clusterId. Both the top and bottom level cluster use the clusterId as the name of the clusters produced. I have added javadocs whereever it felt necessary so it would also help you guide through the code. I have tested using @CanopyClusterConfig as top and bottom level cluster config and it works.The other configs should work out of box.
        Hide
        Jeff Eastman added a comment -

        This patch looks like a refinement of the earlier patch. Writing a Java driver to orchestrate top-down clustering given the Config and Postprocessor instances seems a useful experiment. What is needed to move this patch closer to trunk is: 1) some unit tests of the Java classes, 2) a command line interface. This last requirement is where I get back to my earlier question above: "how is this better than using the existing [CLI] jobs [in a shell script]?"

        To use the Java classes for top clusterer A and bottom clusterer B one needs to provide all of the arguments for A and B. Given all the different flavors of A and B which could be chosen, it still seems really complicated to define a single CLI which can provide all the permutations. Do you have a strategy for this?

        I do think the postprocessor to split the clusteredPointsA into directories so that multiple invocations of B can proceed is useful and I would suggest focusing on that as a stand-alone CLI method first. This would be a minimal first step and save the combinatoric explosion of A,B CLI arguments needed to encapsulate the whole process. With some unit tests and an example script or two, I could see that in trunk very soon.

        Show
        Jeff Eastman added a comment - This patch looks like a refinement of the earlier patch. Writing a Java driver to orchestrate top-down clustering given the Config and Postprocessor instances seems a useful experiment. What is needed to move this patch closer to trunk is: 1) some unit tests of the Java classes, 2) a command line interface. This last requirement is where I get back to my earlier question above: "how is this better than using the existing [CLI] jobs [in a shell script] ?" To use the Java classes for top clusterer A and bottom clusterer B one needs to provide all of the arguments for A and B. Given all the different flavors of A and B which could be chosen, it still seems really complicated to define a single CLI which can provide all the permutations. Do you have a strategy for this? I do think the postprocessor to split the clusteredPointsA into directories so that multiple invocations of B can proceed is useful and I would suggest focusing on that as a stand-alone CLI method first. This would be a minimal first step and save the combinatoric explosion of A,B CLI arguments needed to encapsulate the whole process. With some unit tests and an example script or two, I could see that in trunk very soon.
        Hide
        Paritosh Ranjan added a comment -

        To answer "To use the Java classes for top clusterer A and bottom clusterer B one needs to provide all of the arguments for A and B." I would say that that's needed as the user gets absolute control on the top and bottom level clustering.

        An alternative, to let it be used in a very simple way, would be, to ask for bottom level cluster configs, and a magnitude parameter, which will just increase the magnitude of the bottom level clustering algorithm by magnitude times e.g. t1 = t1*10, or vice versa.

        I will add the Junit test cases soon.

        Show
        Paritosh Ranjan added a comment - To answer "To use the Java classes for top clusterer A and bottom clusterer B one needs to provide all of the arguments for A and B." I would say that that's needed as the user gets absolute control on the top and bottom level clustering. An alternative, to let it be used in a very simple way, would be, to ask for bottom level cluster configs, and a magnitude parameter, which will just increase the magnitude of the bottom level clustering algorithm by magnitude times e.g. t1 = t1*10, or vice versa. I will add the Junit test cases soon.
        Hide
        Jeff Eastman added a comment -

        You've not said how you plan to do a CLI for this, and why the Java classes are an improvement over using the existing CLI calls in a script?

        I don't think a multiplier approach will be workable. In general you need to provide all the A and B arguments, for every combination of A and B including A = B. I believe by instantiating the requisite config and executor classes you can make it work, but the CLI is a big part of the Mahout API.

        Bring on the unit tests.

        Show
        Jeff Eastman added a comment - You've not said how you plan to do a CLI for this, and why the Java classes are an improvement over using the existing CLI calls in a script? I don't think a multiplier approach will be workable. In general you need to provide all the A and B arguments, for every combination of A and B including A = B. I believe by instantiating the requisite config and executor classes you can make it work, but the CLI is a big part of the Mahout API. Bring on the unit tests.
        Hide
        Paritosh Ranjan added a comment -

        I use Java API and not CLI's so I don't have much idea about that. Would appreciate if someone more familiar with CLI's can create it.
        If you still feel that the patch can be useful, then, I can add the Junit tests and submit it.

        Show
        Paritosh Ranjan added a comment - I use Java API and not CLI's so I don't have much idea about that. Would appreciate if someone more familiar with CLI's can create it. If you still feel that the patch can be useful, then, I can add the Junit tests and submit it.
        Hide
        Jeff Eastman added a comment -

        Well, passing off the hardest part of the problem to "someone more familiar" is guaranteed to make this patch sit in limbo. Right now the Java class approach is an interesting experiment. To complete the feature submission, you really need to address the CLI too. Or at least find that somebody to help you get it done. A half-done feature won't get committed; we are actually moving to remove such features from trunk to get ready for a 1.0 release next year.

        You still have not said why you believe the Java approach is better than using the existing CLIs in a script with your postprocessor.

        FWIW, I think the postprocessor, by itself, with a CLI, JavaDocs, an example script and Unit tests is you best path to a submission which will pass muster.

        Show
        Jeff Eastman added a comment - Well, passing off the hardest part of the problem to "someone more familiar" is guaranteed to make this patch sit in limbo. Right now the Java class approach is an interesting experiment. To complete the feature submission, you really need to address the CLI too. Or at least find that somebody to help you get it done. A half-done feature won't get committed; we are actually moving to remove such features from trunk to get ready for a 1.0 release next year. You still have not said why you believe the Java approach is better than using the existing CLIs in a script with your postprocessor. FWIW, I think the postprocessor, by itself, with a CLI, JavaDocs, an example script and Unit tests is you best path to a submission which will pass muster.
        Hide
        Paritosh Ranjan added a comment -

        I don't see much of a "comparison" between a Java Approach and a CLI. I see these two as separate means to perform the same task. I think Mahout provides both ways to accomplish most of the tasks. So, to me, this question is like "Why KMeansDriver is better than CLI to do KMeans", which I think depends on the way user wants to use it.

        So, I don't see the reason of questioning the Java API. This helps the user to accomplish top down clustering, with different clustering algorithms, without getting into its intricacies.

        I also don't think that creating a CLI would be the hardest part of the feature. So, I can create the CLI with top-bottom parameters for TopDownClustering all together. Because I think that many parameters to run Clustering are common to most of the algorithms, so its not going to be that complicated and complex. But, does creating this CLI and writing the Junit Tests complete the feature?

        Writing a CLI which does clustering, post processing, and again clustering makes sense as it helps reducing parameters. But, still, why agains the Java API?

        Show
        Paritosh Ranjan added a comment - I don't see much of a "comparison" between a Java Approach and a CLI. I see these two as separate means to perform the same task. I think Mahout provides both ways to accomplish most of the tasks. So, to me, this question is like "Why KMeansDriver is better than CLI to do KMeans", which I think depends on the way user wants to use it. So, I don't see the reason of questioning the Java API. This helps the user to accomplish top down clustering, with different clustering algorithms, without getting into its intricacies. I also don't think that creating a CLI would be the hardest part of the feature. So, I can create the CLI with top-bottom parameters for TopDownClustering all together. Because I think that many parameters to run Clustering are common to most of the algorithms, so its not going to be that complicated and complex. But, does creating this CLI and writing the Junit Tests complete the feature? Writing a CLI which does clustering, post processing, and again clustering makes sense as it helps reducing parameters. But, still, why agains the Java API?
        Hide
        Paritosh Ranjan added a comment -

        Jeff, I analyzed the CLI creation mechanism. And also the shell script mechanism. Both are achievable and there is not much difference in my view.

        The only advantage that I can think of by using top-bottom config parameters in CLI is that, it remains immune to the internal implementation of top down clustering. This makes it similar to other clustering CLI's. If we use shell script, then we write logic in it ( Iterate on top level clusters ). If the internal implementation of the algorithm changes, then the shell script will be affected.

        The advantage with shell script is that, it makes CLI creation easier. Provided, the user is writing the shell script. Else, the shell script would also need arguments to define top and bottom level clustering parameters. If the user writes the shell script, then we make life tough for him ( a little ). ( Question : Who will provide the shell script, Mahout, or the user ).

        Another advantage of exposing post processor as a CLI is that, it enables Mahout to group points of similar clusters together after clustering, which is not available in Mahout yet.

        Based on this analysis, which approach do you suggest?

        Show
        Paritosh Ranjan added a comment - Jeff, I analyzed the CLI creation mechanism. And also the shell script mechanism. Both are achievable and there is not much difference in my view. The only advantage that I can think of by using top-bottom config parameters in CLI is that, it remains immune to the internal implementation of top down clustering. This makes it similar to other clustering CLI's. If we use shell script, then we write logic in it ( Iterate on top level clusters ). If the internal implementation of the algorithm changes, then the shell script will be affected. The advantage with shell script is that, it makes CLI creation easier. Provided, the user is writing the shell script. Else, the shell script would also need arguments to define top and bottom level clustering parameters. If the user writes the shell script, then we make life tough for him ( a little ). ( Question : Who will provide the shell script, Mahout, or the user ). Another advantage of exposing post processor as a CLI is that, it enables Mahout to group points of similar clusters together after clustering, which is not available in Mahout yet. Based on this analysis, which approach do you suggest?
        Hide
        Jeff Eastman added a comment -

        Paritosh,
        Top-down Clustering involves running a top driver A which produces clustered points as output. Then a postprocessing job moves each of the k cluster's points into a separate folder which is given as input to each of the k bottom clustering drivers B. Given the existence of the postprocessing job then a user can elect to code it entirely in Java or write a shell script to use the CLI for each of the three steps.

        Though your postprocessor is still sequential and will not scale to large datasets, I see creating a CLI for a M/R version of this as being the smallest incremental change to Mahout which will facilitate top-down clustering. Since each choice of A and B clustering algorithms carries its own set of parameters, I see this as making an overall CLI to bundle the entire top-down process as problematic.

        I can see you approach in the pure Java implementation is creating config and executor classes which bundle up the top and bottom cluster driver parameters and then orchestrate the top-down clustering process. Then, in the middle, the postprocessor is run to set up the bottom clustering folders. This is not a complicated pattern for users to do manually: configure A and run it; run the postprocessor; then configure B and run it against each of the bottom level input dictionaries.

        From a minimalize perspective, all we really need is a scalable postprocessor with Java driver & CLI and an example shell script that shows how to do top-down with one particular set of A and B.

        Show
        Jeff Eastman added a comment - Paritosh, Top-down Clustering involves running a top driver A which produces clustered points as output. Then a postprocessing job moves each of the k cluster's points into a separate folder which is given as input to each of the k bottom clustering drivers B. Given the existence of the postprocessing job then a user can elect to code it entirely in Java or write a shell script to use the CLI for each of the three steps. Though your postprocessor is still sequential and will not scale to large datasets, I see creating a CLI for a M/R version of this as being the smallest incremental change to Mahout which will facilitate top-down clustering. Since each choice of A and B clustering algorithms carries its own set of parameters, I see this as making an overall CLI to bundle the entire top-down process as problematic. I can see you approach in the pure Java implementation is creating config and executor classes which bundle up the top and bottom cluster driver parameters and then orchestrate the top-down clustering process. Then, in the middle, the postprocessor is run to set up the bottom clustering folders. This is not a complicated pattern for users to do manually: configure A and run it; run the postprocessor; then configure B and run it against each of the bottom level input dictionaries. From a minimalize perspective, all we really need is a scalable postprocessor with Java driver & CLI and an example shell script that shows how to do top-down with one particular set of A and B.
        Hide
        Paritosh Ranjan added a comment -

        Ok. I agree that implementing the post processor will be the smallest step which will make top down clustering work. Though the user will have to manually code some part of it.

        If we see this post processor as the smallest step towards implementing the top down clustering, and considering we are following incremental development ( which I have guessed from your comments ), can you tell what all would we need for a full fledged top down clustering, in incremental order?

        I have added the CLI to the post processor. The CLI asks for the output path given to the cluster driver and then post processes it. Is it ok?

        Would add some Junit Tests and submit the patch.

        Show
        Paritosh Ranjan added a comment - Ok. I agree that implementing the post processor will be the smallest step which will make top down clustering work. Though the user will have to manually code some part of it. If we see this post processor as the smallest step towards implementing the top down clustering, and considering we are following incremental development ( which I have guessed from your comments ), can you tell what all would we need for a full fledged top down clustering, in incremental order? I have added the CLI to the post processor. The CLI asks for the output path given to the cluster driver and then post processes it. Is it ok? Would add some Junit Tests and submit the patch.
        Hide
        Paritosh Ranjan added a comment -

        Hi Jeff, I have added the patch which has the CLI and Junit test for PathDirectory and ClusterOutputPostProcessor. These two classes combine the post processor, along with TopDownClusteringPathConstants.

        I have tried to mirror a test case from TestKMeansClustering and running the post processor on it. My dev env is having trouble running Mahout test cases ( its windows, though I am having cygwin, still it creates problem ). So, I would also appreciate your help in adding some meaningful assertions in test cases.

        This is also my first attempt at creating the CLI, so please help if I did something wrong there.

        Thanks.

        Show
        Paritosh Ranjan added a comment - Hi Jeff, I have added the patch which has the CLI and Junit test for PathDirectory and ClusterOutputPostProcessor. These two classes combine the post processor, along with TopDownClusteringPathConstants. I have tried to mirror a test case from TestKMeansClustering and running the post processor on it. My dev env is having trouble running Mahout test cases ( its windows, though I am having cygwin, still it creates problem ). So, I would also appreciate your help in adding some meaningful assertions in test cases. This is also my first attempt at creating the CLI, so please help if I did something wrong there. Thanks.
        Hide
        Jeff Eastman added a comment -

        I've downloaded and installed your latest patch and it mostly passed (1 hunk failed in src/conf/driver/classes.props). I tried running the ClusteredOutputPostProcessorTest and it failed with an IOException: wrong value class at ClusterOutputPostProcessor line 94.

        Looking at your unit test, I'd suggest simplifying it a lot:

        • Use the sequential version of Canopy to create your top clusteredPoints directory. It writes the same files as the mapreduce version and runs a lot faster during a build.
        • Skip the k-means step as it adds no value when testing the postprocessor. The canopy clusteredPoints are all you need.
        • Get your sequential version of postProcessor working and verify that the points output to the respective input directories for the bottom clustering are correct.
        • Run a bottom clustering canopy job if you want to prove you got the input file directories right in the previous step, but make it sequential too
        • Delete the SpectralKMeans stuff. It uses an affinity matrix as input and not a list of input vectors. It also won't produce clusteredPoints like the other algos. I'd concentrate on Canopy, KMeans, FuzzyK, MeanShift and Dirichlet which all behave similarly.
        • Make a new small patch with just the postprocessor stuff in it.
        • Write a small shell script to invoke the canopy top, the postprocessor and the canopy bottom using the CLIs for both. Maybe have a couple of flavors using different top/bottom combinations.

        From a minimalist point of view, this would make a reasonable Mahout submission to enable hierarchical clustering

        Show
        Jeff Eastman added a comment - I've downloaded and installed your latest patch and it mostly passed (1 hunk failed in src/conf/driver/classes.props). I tried running the ClusteredOutputPostProcessorTest and it failed with an IOException: wrong value class at ClusterOutputPostProcessor line 94. Looking at your unit test, I'd suggest simplifying it a lot: Use the sequential version of Canopy to create your top clusteredPoints directory. It writes the same files as the mapreduce version and runs a lot faster during a build. Skip the k-means step as it adds no value when testing the postprocessor. The canopy clusteredPoints are all you need. Get your sequential version of postProcessor working and verify that the points output to the respective input directories for the bottom clustering are correct. Run a bottom clustering canopy job if you want to prove you got the input file directories right in the previous step, but make it sequential too Delete the SpectralKMeans stuff. It uses an affinity matrix as input and not a list of input vectors. It also won't produce clusteredPoints like the other algos. I'd concentrate on Canopy, KMeans, FuzzyK, MeanShift and Dirichlet which all behave similarly. Make a new small patch with just the postprocessor stuff in it. Write a small shell script to invoke the canopy top, the postprocessor and the canopy bottom using the CLIs for both. Maybe have a couple of flavors using different top/bottom combinations. From a minimalist point of view, this would make a reasonable Mahout submission to enable hierarchical clustering
        Hide
        Jeff Eastman added a comment - - edited

        After completing the above, I'd recommend creating a mapreduce version of the postprocessor. If you have TBs of vectors to cluster you will be most unhappy with the performance of the sequential version. I have some ideas on how to do this:

        • Mapper reads clusteredPoints output and emits each VectorWritable to its clusterId
        • Driver needs to set numReducers to be the number of clusters present in the clusteredPoints. You can compute this by reading the clusters-*-final directory and counting the clusters.
        • Each reducer will receive all the VW points for a single cluster and will output a part file with key=clusterId value= {VW points}
        • Subsequent driver code needs to move each part-r-xxx file into its own directory so the bottom clustering job can take that as input. This will likely be a whopping large file so make sure it is splittable (I think sequenceFiles are already).
        • Implement -xm option on your postprocessor driver so that it can run either sequentially or mapreduce. Both should produce the same results.
        Show
        Jeff Eastman added a comment - - edited After completing the above, I'd recommend creating a mapreduce version of the postprocessor. If you have TBs of vectors to cluster you will be most unhappy with the performance of the sequential version. I have some ideas on how to do this: Mapper reads clusteredPoints output and emits each VectorWritable to its clusterId Driver needs to set numReducers to be the number of clusters present in the clusteredPoints. You can compute this by reading the clusters-*-final directory and counting the clusters. Each reducer will receive all the VW points for a single cluster and will output a part file with key=clusterId value= {VW points} Subsequent driver code needs to move each part-r-xxx file into its own directory so the bottom clustering job can take that as input. This will likely be a whopping large file so make sure it is splittable (I think sequenceFiles are already). Implement -xm option on your postprocessor driver so that it can run either sequentially or mapreduce. Both should produce the same results.
        Hide
        Paritosh Ranjan added a comment -

        Thanks for your inputs Jeff. I will try to provide the patch according to your suggestions soon.

        Show
        Paritosh Ranjan added a comment - Thanks for your inputs Jeff. I will try to provide the patch according to your suggestions soon.
        Hide
        Paritosh Ranjan added a comment - - edited

        I have added the Junit test as suggested by you. The output processor is running properly. Which is also evident from the Junit Test.

        Regarding the bottom level clustering through Canopy Clustering. Through Junit test, I have found that, CanopyDriver is reopening SequenceFile.Writer on the clustered files. Since SequenceFile.Writer does not support appending data after reopening Writer, so, the data is being overridden over there.

        This overwriting issue is present only in the sequential version of clusterData. clusterDataSeq method overwrites it. I used the Java API on hadoop cluster and it worked fine, which uses clusterDataMR.

        Show
        Paritosh Ranjan added a comment - - edited I have added the Junit test as suggested by you. The output processor is running properly. Which is also evident from the Junit Test. Regarding the bottom level clustering through Canopy Clustering. Through Junit test, I have found that, CanopyDriver is reopening SequenceFile.Writer on the clustered files. Since SequenceFile.Writer does not support appending data after reopening Writer, so, the data is being overridden over there. This overwriting issue is present only in the sequential version of clusterData. clusterDataSeq method overwrites it. I used the Java API on hadoop cluster and it worked fine, which uses clusterDataMR.
        Hide
        Jeff Eastman added a comment -

        I applied the patch and had the same conflicts with src/conf/driver.classes.props. There remain a number of unresolved external references which make me unable to run the test. Suggest you need to add at least the TopDownClusteringPathConstants to the patch, maybe more.

        Can you be more explicit about what is happening in the sequential Canopy overwrite case?

        Suggest you need to do an svn up before making your next patch file to pick up all the other changes that have happened in the interim.

        Show
        Jeff Eastman added a comment - I applied the patch and had the same conflicts with src/conf/driver.classes.props. There remain a number of unresolved external references which make me unable to run the test. Suggest you need to add at least the TopDownClusteringPathConstants to the patch, maybe more. Can you be more explicit about what is happening in the sequential Canopy overwrite case? Suggest you need to do an svn up before making your next patch file to pick up all the other changes that have happened in the interim.
        Hide
        Paritosh Ranjan added a comment - - edited

        I have taken all incoming changes and created the patch. Also added TopDownClusteringPathConstants. Can't see any other external reference.

        The clusterDataMR and clusterDataSeq, both overwrite the points in clusteredPoints when the input file provided has more than one paths, which is the case in the input of bottom level cluster.

        The test case does top level clustering, asserts cluster output processor, both of which works fine. Then it is asserting bottom level clustering which shows the problem. Only one point is written(overridden) in one cluster. This can be seen while debugging clusterDataSeq.

        Show
        Paritosh Ranjan added a comment - - edited I have taken all incoming changes and created the patch. Also added TopDownClusteringPathConstants. Can't see any other external reference. The clusterDataMR and clusterDataSeq, both overwrite the points in clusteredPoints when the input file provided has more than one paths, which is the case in the input of bottom level cluster. The test case does top level clustering, asserts cluster output processor, both of which works fine. Then it is asserting bottom level clustering which shows the problem. Only one point is written(overridden) in one cluster. This can be seen while debugging clusterDataSeq.
        Hide
        Paritosh Ranjan added a comment -

        Made some changes in the test case. Please use this latest patch.

        Show
        Paritosh Ranjan added a comment - Made some changes in the test case. Please use this latest patch.
        Hide
        Paritosh Ranjan added a comment -

        Jeff, I was analyzing the idea of the MapReduce version of the PostProcessor. I have a query on the solution you have proposed.

        As you said "Each reducer will receive all the VW points for a single cluster and will output a part file with key=clusterId value=

        {VW points}

        ".

        What will happen if the number of reducers available is less than the number of clusters? Then, I think, each reducer will get more than once cluster's vectors. Please correct me if I am wrong.

        Show
        Paritosh Ranjan added a comment - Jeff, I was analyzing the idea of the MapReduce version of the PostProcessor. I have a query on the solution you have proposed. As you said "Each reducer will receive all the VW points for a single cluster and will output a part file with key=clusterId value= {VW points} ". What will happen if the number of reducers available is less than the number of clusters? Then, I think, each reducer will get more than once cluster's vectors. Please correct me if I am wrong.
        Hide
        Jeff Eastman added a comment -

        You are correct. This is why the number of reducers needs to be >= the number of clusters. Note the postprocessor then needs to move each part file into its own directory for the bottom clustering to use. I think the bottom clusterer can be given a single (likely huge) part file but we need to make sure this can be split automatically by Hadoop or the large file needs to be broken up into multiple smaller files (by another MR job).

        Show
        Jeff Eastman added a comment - You are correct. This is why the number of reducers needs to be >= the number of clusters. Note the postprocessor then needs to move each part file into its own directory for the bottom clustering to use. I think the bottom clusterer can be given a single (likely huge) part file but we need to make sure this can be split automatically by Hadoop or the large file needs to be broken up into multiple smaller files (by another MR job).
        Hide
        Paritosh Ranjan added a comment -

        The output will be incorrect if the number of reducers < number of clusters. I think the job should fail in this case. What do you suggest?

        I think the sequence files are already broken in different parts. But still, we can keep an eye on this.

        Meanwhile, did you get chance to analyze the patch for cluster output processor.

        Show
        Paritosh Ranjan added a comment - The output will be incorrect if the number of reducers < number of clusters. I think the job should fail in this case. What do you suggest? I think the sequence files are already broken in different parts. But still, we can keep an eye on this. Meanwhile, did you get chance to analyze the patch for cluster output processor.
        Hide
        Jeff Eastman added a comment -

        Actually, I'd suggest loading the clusters-n-final clusters to count the number of reducers required rather than making it an argument and then failing if it is wrong.

        On the patch, it loads and the tests all run. I've not dug into it beyond that yet.

        Show
        Jeff Eastman added a comment - Actually, I'd suggest loading the clusters-n-final clusters to count the number of reducers required rather than making it an argument and then failing if it is wrong. On the patch, it loads and the tests all run. I've not dug into it beyond that yet.
        Hide
        Jeff Eastman added a comment -

        I've had some time to poke around in the code and single-step through the test execution. It seems to work as advertised but I had a difficult time reading the code and I'm concerned that the implementation has some performance challenges. Much of this involves personal style so take it with a grain of salt:

        • The code is very fine grain, with many small methods. Usually I'm faced with the opposite - huge methods that need to be decomposed - but here the smallness works against readability. Consider distributeVectors() which calls putVectorsInRespectiveClusters() which calls findClusterAndAddVector() which calls addVectorToItsClusterFile() which calls writeVectorToCluster(). These small methods are only called in a single place and could be inlined to provide more context for how the point is finally written.
        • The ClusterOutputPostProcessor has a number of private fields which are initialized and used within this method chain rather than passing needed values in as method arguments. If I'm stopped in the debugger as I am now, it is challenging for me to identify where and how a particular field was initialized because it is not visible in the stack frames of the call chain.
        • WriteVectorToCluster() creates a SequenceFile.Writer for each point, then appends to it syncs it and closes it. The path for the output file is passed in two levels. Seems to me that will thrash the GC and I'd consider passing the writer down instead of the path so you only open/append/sync/close one for each cluster.
        • Minor now, but the run method has one -i argument which is immediately assigned to clusterOutputToBeProcessed. And the inputWriter is actually writing output. Seems backwards. I'd suggest adding a -o argument so callers have full control over where the points come from and where they get written.
        • Finally, I'd also suggest adding a -ci argument so that the clusters can be read in the beginning. This will facilitate setting numReducers later and would allow you to create all the output writers at the beginning rather than lazily as they are encountered.

        All this aside however, it seems to work and is a good step forward. Once the mapreduce version gets fleshed out you will likely want to refactor the whole thing anyway.

        Show
        Jeff Eastman added a comment - I've had some time to poke around in the code and single-step through the test execution. It seems to work as advertised but I had a difficult time reading the code and I'm concerned that the implementation has some performance challenges. Much of this involves personal style so take it with a grain of salt: The code is very fine grain, with many small methods. Usually I'm faced with the opposite - huge methods that need to be decomposed - but here the smallness works against readability. Consider distributeVectors() which calls putVectorsInRespectiveClusters() which calls findClusterAndAddVector() which calls addVectorToItsClusterFile() which calls writeVectorToCluster(). These small methods are only called in a single place and could be inlined to provide more context for how the point is finally written. The ClusterOutputPostProcessor has a number of private fields which are initialized and used within this method chain rather than passing needed values in as method arguments. If I'm stopped in the debugger as I am now, it is challenging for me to identify where and how a particular field was initialized because it is not visible in the stack frames of the call chain. WriteVectorToCluster() creates a SequenceFile.Writer for each point, then appends to it syncs it and closes it. The path for the output file is passed in two levels. Seems to me that will thrash the GC and I'd consider passing the writer down instead of the path so you only open/append/sync/close one for each cluster. Minor now, but the run method has one -i argument which is immediately assigned to clusterOutputToBeProcessed. And the inputWriter is actually writing output. Seems backwards. I'd suggest adding a -o argument so callers have full control over where the points come from and where they get written. Finally, I'd also suggest adding a -ci argument so that the clusters can be read in the beginning. This will facilitate setting numReducers later and would allow you to create all the output writers at the beginning rather than lazily as they are encountered. All this aside however, it seems to work and is a good step forward. Once the mapreduce version gets fleshed out you will likely want to refactor the whole thing anyway.
        Hide
        Jeff Eastman added a comment -

        I've had some time to poke around in the code and single-step through the test execution. It seems to work as advertised but I had a difficult time reading the code and I'm concerned that the implementation has some performance challenges. Much of this involves personal style so take it with a grain of salt:

        • The code is very fine grain, with many small methods. Usually I'm faced with the opposite - huge methods that need to be decomposed - but here the smallness works against readability. Consider distributeVectors() which calls putVectorsInRespectiveClusters() which calls findClusterAndAddVector() which calls addVectorToItsClusterFile() which calls writeVectorToCluster(). These small methods are only called in a single place and could be inlined to provide more context for how the point is finally written.
        • The ClusterOutputPostProcessor has a number of private fields which are initialized and used within this method chain rather than passing needed values in as method arguments. If I'm stopped in the debugger as I am now, it is challenging for me to identify where and how a particular field was initialized because it is not visible in the stack frames of the call chain.
        • WriteVectorToCluster() creates a SequenceFile.Writer for each point, then appends to it syncs it and closes it. The path for the output file is passed in two levels. Seems to me that will thrash the GC and I'd consider passing the writer down instead of the path so you only open/append/sync/close one for each cluster.
        • Minor now, but the run method has one -i argument which is immediately assigned to clusterOutputToBeProcessed. And the inputWriter is actually writing output. Seems backwards. I'd suggest adding a -o argument so callers have full control over where the points come from and where they get written.
        • Finally, I'd also suggest adding a -ci argument so that the clusters can be read in the beginning. This will facilitate setting numReducers later and would allow you to create all the output writers at the beginning rather than lazily as they are encountered.

        All this aside however, it seems to work and is a good step forward. Once the mapreduce version gets fleshed out you will likely want to refactor the whole thing anyway.

        Show
        Jeff Eastman added a comment - I've had some time to poke around in the code and single-step through the test execution. It seems to work as advertised but I had a difficult time reading the code and I'm concerned that the implementation has some performance challenges. Much of this involves personal style so take it with a grain of salt: The code is very fine grain, with many small methods. Usually I'm faced with the opposite - huge methods that need to be decomposed - but here the smallness works against readability. Consider distributeVectors() which calls putVectorsInRespectiveClusters() which calls findClusterAndAddVector() which calls addVectorToItsClusterFile() which calls writeVectorToCluster(). These small methods are only called in a single place and could be inlined to provide more context for how the point is finally written. The ClusterOutputPostProcessor has a number of private fields which are initialized and used within this method chain rather than passing needed values in as method arguments. If I'm stopped in the debugger as I am now, it is challenging for me to identify where and how a particular field was initialized because it is not visible in the stack frames of the call chain. WriteVectorToCluster() creates a SequenceFile.Writer for each point, then appends to it syncs it and closes it. The path for the output file is passed in two levels. Seems to me that will thrash the GC and I'd consider passing the writer down instead of the path so you only open/append/sync/close one for each cluster. Minor now, but the run method has one -i argument which is immediately assigned to clusterOutputToBeProcessed. And the inputWriter is actually writing output. Seems backwards. I'd suggest adding a -o argument so callers have full control over where the points come from and where they get written. Finally, I'd also suggest adding a -ci argument so that the clusters can be read in the beginning. This will facilitate setting numReducers later and would allow you to create all the output writers at the beginning rather than lazily as they are encountered. All this aside however, it seems to work and is a good step forward. Once the mapreduce version gets fleshed out you will likely want to refactor the whole thing anyway.
        Hide
        Paritosh Ranjan added a comment -

        Hi Jeff,

        I was working on the MapReduce version of the post processor and the solution you proposed is shaping up. Looks like a really cool solution.

        I have a doubt regarding calculation of number of clusters. I think you want to count the number of centroids produced in the final directory. Am I correct? If yes, then, is this final directory produced in all the algorithms we want to support, as you mentioned "Canopy, KMeans, FuzzyK, MeanShift and Dirichlet". If I have some misconception, then please help to clarify that.

        I was also looking into the refactoring that you suggested.
        -I will make the code more readable. So, the method naming problem would be solved.
        -Instead of opening and closing SequenceFile.Writer, I can keep them in a map, for each clusterId, and reuse it. In the end, we can close all of them together Will this solve the issue?
        -Regarding the private fields stuff. I will say that, it looks better to me as a code point of view. It might be difficult to debug, but I think too many variables in a method does not account for clean code. I will suggest to keep it the way it is now. What do you suggest?

        Show
        Paritosh Ranjan added a comment - Hi Jeff, I was working on the MapReduce version of the post processor and the solution you proposed is shaping up. Looks like a really cool solution. I have a doubt regarding calculation of number of clusters. I think you want to count the number of centroids produced in the final directory. Am I correct? If yes, then, is this final directory produced in all the algorithms we want to support, as you mentioned "Canopy, KMeans, FuzzyK, MeanShift and Dirichlet". If I have some misconception, then please help to clarify that. I was also looking into the refactoring that you suggested. -I will make the code more readable. So, the method naming problem would be solved. -Instead of opening and closing SequenceFile.Writer, I can keep them in a map, for each clusterId, and reuse it. In the end, we can close all of them together Will this solve the issue? -Regarding the private fields stuff. I will say that, it looks better to me as a code point of view. It might be difficult to debug, but I think too many variables in a method does not account for clean code. I will suggest to keep it the way it is now. What do you suggest?
        Hide
        Jeff Eastman added a comment -

        All the clustering algorithms you mention write their Clusters to clusters-n-final, so if you load them from that directory you will get the correct counts to set numReducers with.

        I'm not hung up at all on the style issues I suggested. I think you are making the most important improvements and that should suffice.

        Looking forward to seeing the mapreduce version.

        Show
        Jeff Eastman added a comment - All the clustering algorithms you mention write their Clusters to clusters-n-final, so if you load them from that directory you will get the correct counts to set numReducers with. I'm not hung up at all on the style issues I suggested. I think you are making the most important improvements and that should suffice. Looking forward to seeing the mapreduce version.
        Hide
        Paritosh Ranjan added a comment -

        I am doing Canopy Clustering and I just get clusters-0 directory. There is no clusters-n-final. Is this the expected behavior or am I doing something wrong?

        Show
        Paritosh Ranjan added a comment - I am doing Canopy Clustering and I just get clusters-0 directory. There is no clusters-n-final. Is this the expected behavior or am I doing something wrong?
        Hide
        Jeff Eastman added a comment -

        You are correct. Canopy writes to clusters-0 only. The iterative algorithms rename their last clusters-n directory to clusters-n-final to make accessing the final clusters simpler. Canopy could easily be adjusted to write to clusters-0-final if that makes things easier.

        Show
        Jeff Eastman added a comment - You are correct. Canopy writes to clusters-0 only. The iterative algorithms rename their last clusters-n directory to clusters-n-final to make accessing the final clusters simpler. Canopy could easily be adjusted to write to clusters-0-final if that makes things easier.
        Hide
        Paritosh Ranjan added a comment -

        If it can be done easily, then I think we should do it. Can you do that? Then it will make the ClusterCountReader's code consistent. The MapReduce version is working properly and splitting clusters into respective part files. If you can commit the Canopy code, then I will be able to create the patch.

        I was also thinking of not moving these part files into their respective directories. I think it can be avoided as reduction in this processing step will increase performance. And, the files are already separated, so the bottom clustering will be anyway able to use them. What do you suggest?

        Show
        Paritosh Ranjan added a comment - If it can be done easily, then I think we should do it. Can you do that? Then it will make the ClusterCountReader's code consistent. The MapReduce version is working properly and splitting clusters into respective part files. If you can commit the Canopy code, then I will be able to create the patch. I was also thinking of not moving these part files into their respective directories. I think it can be avoided as reduction in this processing step will increase performance. And, the files are already separated, so the bottom clustering will be anyway able to use them. What do you suggest?
        Hide
        Paritosh Ranjan added a comment -

        Implemented MapReduce version of Post Processor.

        Improved readability of the code as per suggestions.

        Added output option and sequential/mapreduce execution option.

        Keeping SequenceFile.Writer in Map and closing them in end which has improved performance. This has also fixed the problem mentioned earlier that "The clusterData is overriding points". The test case of ClusterOutputProcessor demonstrates top down clustering.

        Reading clusters from cluster-*-final directory. So, Canopy Clustering would need a fix as discussed above.

        Show
        Paritosh Ranjan added a comment - Implemented MapReduce version of Post Processor. Improved readability of the code as per suggestions. Added output option and sequential/mapreduce execution option. Keeping SequenceFile.Writer in Map and closing them in end which has improved performance. This has also fixed the problem mentioned earlier that "The clusterData is overriding points". The test case of ClusterOutputProcessor demonstrates top down clustering. Reading clusters from cluster-*-final directory. So, Canopy Clustering would need a fix as discussed above.
        Hide
        Paritosh Ranjan added a comment -

        Added test case for ClusterCountReader. Improved assertions of top down clustering.

        Tested Cluster Output post processor for sequential as well as MapReduce using K-Means algorithm. Works fine.

        I have not moved the part files to specific directories yet. Waiting for your suggestion on that. To me, it looks like keeping the clusters in the part files itself is not a problem. As it will save a processing step and increase performance.

        Show
        Paritosh Ranjan added a comment - Added test case for ClusterCountReader. Improved assertions of top down clustering. Tested Cluster Output post processor for sequential as well as MapReduce using K-Means algorithm. Works fine. I have not moved the part files to specific directories yet. Waiting for your suggestion on that. To me, it looks like keeping the clusters in the part files itself is not a problem. As it will save a processing step and increase performance.
        Hide
        Paritosh Ranjan added a comment - - edited

        Able to move part files into their respective directories, without any performance issues by using FileSystem.rename(Path,Path) to move files.

        Writing canopy clusters to clusters-0-final.

        The sequential and mapreduce are giving the same results, in same format.

        I think its done.

        Show
        Paritosh Ranjan added a comment - - edited Able to move part files into their respective directories, without any performance issues by using FileSystem.rename(Path,Path) to move files. Writing canopy clusters to clusters-0-final. The sequential and mapreduce are giving the same results, in same format. I think its done.
        Hide
        Paritosh Ranjan added a comment -

        Hi Jeff,

        Did you get any chance to analyze the patch.

        Show
        Paritosh Ranjan added a comment - Hi Jeff, Did you get any chance to analyze the patch.
        Hide
        Paritosh Ranjan added a comment -

        How to Contribute page talks about gentle reminders after few days of submitting patch. So, it is one of them .

        Show
        Paritosh Ranjan added a comment - How to Contribute page talks about gentle reminders after few days of submitting patch. So, it is one of them .
        Hide
        Paritosh Ranjan added a comment -

        Another friendly reminder , as suggested on how to contribute page.

        I think Jeff is a bit busy nowadays.

        The latest patch has all the suggestions made by Jeff incorporated, it works as desired, and is ready to be committed. It is also simple to review if you just read the discussions.

        So, anyone having any idea about clustering can review it. Or, I wait.. . It has been quite some time.

        Show
        Paritosh Ranjan added a comment - Another friendly reminder , as suggested on how to contribute page. I think Jeff is a bit busy nowadays. The latest patch has all the suggestions made by Jeff incorporated, it works as desired, and is ready to be committed. It is also simple to review if you just read the discussions. So, anyone having any idea about clustering can review it. Or, I wait.. . It has been quite some time.
        Hide
        Paritosh Ranjan added a comment -

        Another reminder to review this patch.

        Show
        Paritosh Ranjan added a comment - Another reminder to review this patch.
        Hide
        Jeff Eastman added a comment -

        Sorry, I was moving to Denver. I will review this patch next.

        Show
        Jeff Eastman added a comment - Sorry, I was moving to Denver. I will review this patch next.
        Hide
        Jeff Eastman added a comment -

        The patch installs cleanly and its unit tests run. But a full mvn install fails the K-means unit test, likely because of the clusters-0-final change in Canopy. There are some JavaDocs in need of completion and I don't see how the execution mode can ever be set in the driver. Otherwise it is looking pretty reasonable and is almost ready to commit. Could you add a wiki page about it too?

        Show
        Jeff Eastman added a comment - The patch installs cleanly and its unit tests run. But a full mvn install fails the K-means unit test, likely because of the clusters-0-final change in Canopy. There are some JavaDocs in need of completion and I don't see how the execution mode can ever be set in the driver. Otherwise it is looking pretty reasonable and is almost ready to commit. Could you add a wiki page about it too?
        Hide
        Paritosh Ranjan added a comment -

        Hi Jeff,

        I have started adding the content in wiki page https://cwiki.apache.org/MAHOUT/top-down-clustering.html. Is the location fine?

        The execution mode you are talking about is the sequential/mapreduce version. Correct? If yes, then there is a paramter for that. The sequential or the mapreduce version is executed based on that.

        Do I need to fix the KMeansTests and the Javadocs? If yes, would you like to specify some methods where the javadocs are not good, or, I will just have a second look on them and try to make them better. Can you also provide the class names where the tests are failing ( because I skip tests to build Mahout, as I use windows).

        Show
        Paritosh Ranjan added a comment - Hi Jeff, I have started adding the content in wiki page https://cwiki.apache.org/MAHOUT/top-down-clustering.html . Is the location fine? The execution mode you are talking about is the sequential/mapreduce version. Correct? If yes, then there is a paramter for that. The sequential or the mapreduce version is executed based on that. Do I need to fix the KMeansTests and the Javadocs? If yes, would you like to specify some methods where the javadocs are not good, or, I will just have a second look on them and try to make them better. Can you also provide the class names where the tests are failing ( because I skip tests to build Mahout, as I use windows).
        Hide
        Jeff Eastman added a comment -

        The wiki page looks like a good start. If you could include an example using the post processor it would help others to understand how to use it.

        The driver run() method adds -i and -o options but does not add the -xm option

        addOption(DefaultOptionCreator.methodOption().create());

        thus the subsequent getOption(DefaultOptionCreator.METHOD_OPTION) will always return null and sequential execution cannot be enabled.

        I'd like to see all the public methods, at least, have useful JavaDocs. When I look at the ClusterOutputPostProcessorDriver, there are several methods with incomplete JavaDoc comments. The 3 TODO indicators are a place to start, but many of the other methods in the patch do not include descriptions of the method arguments. In Eclipse, and I guess in IntelliJ, you can have the IDE flag these situations.

        The k-means unit test which is failing is testKMeansWithCanopyClusterId, which fails because it is looking for "clusters-0" and not "clusters-0-final". There are also other unit tests which fail for the same reasons. I can fix them all before I commit the patch, but have you considered doing your development in a Linux VM on your Windows box? It is good practice to always run a full clean build before committing, since little changes like "-final" have a way of breaking lots of other code.

        Show
        Jeff Eastman added a comment - The wiki page looks like a good start. If you could include an example using the post processor it would help others to understand how to use it. The driver run() method adds -i and -o options but does not add the -xm option addOption(DefaultOptionCreator.methodOption().create()); thus the subsequent getOption(DefaultOptionCreator.METHOD_OPTION) will always return null and sequential execution cannot be enabled. I'd like to see all the public methods, at least, have useful JavaDocs. When I look at the ClusterOutputPostProcessorDriver, there are several methods with incomplete JavaDoc comments. The 3 TODO indicators are a place to start, but many of the other methods in the patch do not include descriptions of the method arguments. In Eclipse, and I guess in IntelliJ, you can have the IDE flag these situations. The k-means unit test which is failing is testKMeansWithCanopyClusterId, which fails because it is looking for "clusters-0" and not "clusters-0-final". There are also other unit tests which fail for the same reasons. I can fix them all before I commit the patch, but have you considered doing your development in a Linux VM on your Windows box? It is good practice to always run a full clean build before committing, since little changes like "-final" have a way of breaking lots of other code.
        Hide
        Paritosh Ranjan added a comment -

        Added option for xm.

        Revisited all Javadocs. All public methods have proper javadocs now. Private method have javadocs as per need.

        Revisited failing tests which were happening due to clusters-0-final change. Fixed all which I can find. Please fix others before committing, if you are able to find more.

        Thanks for the help for addOption...xm. I was not able to figure that out .

        Show
        Paritosh Ranjan added a comment - Added option for xm. Revisited all Javadocs. All public methods have proper javadocs now. Private method have javadocs as per need. Revisited failing tests which were happening due to clusters-0-final change. Fixed all which I can find. Please fix others before committing, if you are able to find more. Thanks for the help for addOption...xm. I was not able to figure that out .
        Hide
        Paritosh Ranjan added a comment -
        Show
        Paritosh Ranjan added a comment - Also completed the wiki page, https://cwiki.apache.org/MAHOUT/top-down-clustering.html .
        Hide
        Jeff Eastman added a comment -

        Patch looks ready to commit but I am currently blocked running a full build because an unrelated test (TestDistributedRowMatrix.testTranspose) is failing. I want to make sure that all the tests run before I commit.

        On the wiki, I'd like to see a description and example using the CLI interface too. Also, top-down clustering is not limited to top- and bottom- clusterings; you might have an arbitrary number of middle- clusterings in general. Finally, could you look at the organization and formatting of the other clustering algorithms and adopt some similar conventions?

        Show
        Jeff Eastman added a comment - Patch looks ready to commit but I am currently blocked running a full build because an unrelated test (TestDistributedRowMatrix.testTranspose) is failing. I want to make sure that all the tests run before I commit. On the wiki, I'd like to see a description and example using the CLI interface too. Also, top-down clustering is not limited to top- and bottom- clusterings; you might have an arbitrary number of middle- clusterings in general. Finally, could you look at the organization and formatting of the other clustering algorithms and adopt some similar conventions?
        Hide
        Paritosh Ranjan added a comment -

        I will do that.

        Show
        Paritosh Ranjan added a comment - I will do that.
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1236 (See https://builds.apache.org/job/Mahout-Quality/1236/)
        MAHOUT-843: Final patch plus some integration fixes. All tests run

        jeastman : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1211715
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/canopy/CanopyDriver.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/topdown
        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/topdown/PathDirectory.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/topdown/TopDownClusteringPathConstants.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/topdown/postprocessor
        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/topdown/postprocessor/ClusterCountReader.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/topdown/postprocessor/ClusterOutputPostProcessor.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/topdown/postprocessor/ClusterOutputPostProcessorDriver.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/topdown/postprocessor/ClusterOutputPostProcessorMapper.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/topdown/postprocessor/ClusterOutputPostProcessorReducer.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/canopy/TestCanopyCreation.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/kmeans/TestKmeansClustering.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/topdown
        • /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/topdown/PathDirectoryTest.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/topdown/postprocessor
        • /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/topdown/postprocessor/ClusterCountReaderTest.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/topdown/postprocessor/ClusterOutputPostProcessorTest.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/clustering/TestClusterEvaluator.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/clustering/cdbw/TestCDbwEvaluator.java
        • /mahout/trunk/src/conf/clusterpp.props
        • /mahout/trunk/src/conf/driver.classes.props
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1236 (See https://builds.apache.org/job/Mahout-Quality/1236/ ) MAHOUT-843 : Final patch plus some integration fixes. All tests run jeastman : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1211715 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/canopy/CanopyDriver.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/topdown /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/topdown/PathDirectory.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/topdown/TopDownClusteringPathConstants.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/topdown/postprocessor /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/topdown/postprocessor/ClusterCountReader.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/topdown/postprocessor/ClusterOutputPostProcessor.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/topdown/postprocessor/ClusterOutputPostProcessorDriver.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/topdown/postprocessor/ClusterOutputPostProcessorMapper.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/topdown/postprocessor/ClusterOutputPostProcessorReducer.java /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/canopy/TestCanopyCreation.java /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/kmeans/TestKmeansClustering.java /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/topdown /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/topdown/PathDirectoryTest.java /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/topdown/postprocessor /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/topdown/postprocessor/ClusterCountReaderTest.java /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/topdown/postprocessor/ClusterOutputPostProcessorTest.java /mahout/trunk/integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java /mahout/trunk/integration/src/test/java/org/apache/mahout/clustering/TestClusterEvaluator.java /mahout/trunk/integration/src/test/java/org/apache/mahout/clustering/cdbw/TestCDbwEvaluator.java /mahout/trunk/src/conf/clusterpp.props /mahout/trunk/src/conf/driver.classes.props
        Hide
        Paritosh Ranjan added a comment -

        I took code update. In clusterpp.props the code is written twice. To me, it looks like an error while committing. If not, please ignore.

        1. The following parameters must be specified
          #i|input = /path/to/initial/cluster/output
          #o|output = /path/to/output
        1. The following parameters must be specified
          #i|input = /path/to/initial/cluster/output
          #o|output = /path/to/output
        Show
        Paritosh Ranjan added a comment - I took code update. In clusterpp.props the code is written twice. To me, it looks like an error while committing. If not, please ignore. The following parameters must be specified #i|input = /path/to/initial/cluster/output #o|output = /path/to/output The following parameters must be specified #i|input = /path/to/initial/cluster/output #o|output = /path/to/output
        Hide
        Jeff Eastman added a comment -

        I fixed the duplication.

        On the wiki page,it states that "So, all clustering algorithms available in Mahout, other than the MinHash Clustering algorithm ( which is a "Bottom Up" Clustering Algorithm ), are suitable...). I don't believe this is true. Seems to me that any clustering algorithm should be usable in either/any step. Any clustering algorithm which operates upon Vectors and produces WeightedVectorWritables can be used. Isn't that the real criteria?

        Finally, it would be nice to see a CLI example

        Show
        Jeff Eastman added a comment - I fixed the duplication. On the wiki page,it states that "So, all clustering algorithms available in Mahout, other than the MinHash Clustering algorithm ( which is a "Bottom Up" Clustering Algorithm ), are suitable...). I don't believe this is true. Seems to me that any clustering algorithm should be usable in either/any step. Any clustering algorithm which operates upon Vectors and produces WeightedVectorWritables can be used. Isn't that the real criteria? Finally, it would be nice to see a CLI example
        Hide
        Paritosh Ranjan added a comment -

        Ah, Its a mistake. I meant to write MeanShiftClustering, which is actually a bottom up clustering. But yes, it can also be used, so, leaving the choice of clustering algorithm to the user would be the best thing. I will update the wiki.

        Since I have not used CLI yet, I will have to learn how to use it and then I will update the wiki. It might take few days. But I will update it for sure.

        Few questions,

        Is there any need to work on the Java API for top down clustering as I proposed earlier ( with ClusterConfigs and ClusterExecutors )? I don't see much use of that now. As the user can pretty easily do top-middle-middle-...-middle-bottom clustering. I hope you will agree too.

        Secondly, what other things are in "to be implemented" bucket of Clustering. I have some time in hand and I would like to contribute more to Mahout. So, it would be really helpful if you can guide me towards other things that needs to be done. I won't mind looking into a new thing also which might involve some reading/analysis/homework.

        Show
        Paritosh Ranjan added a comment - Ah, Its a mistake. I meant to write MeanShiftClustering, which is actually a bottom up clustering. But yes, it can also be used, so, leaving the choice of clustering algorithm to the user would be the best thing. I will update the wiki. Since I have not used CLI yet, I will have to learn how to use it and then I will update the wiki. It might take few days. But I will update it for sure. Few questions, Is there any need to work on the Java API for top down clustering as I proposed earlier ( with ClusterConfigs and ClusterExecutors )? I don't see much use of that now. As the user can pretty easily do top-middle-middle-...-middle-bottom clustering. I hope you will agree too. Secondly, what other things are in "to be implemented" bucket of Clustering. I have some time in hand and I would like to contribute more to Mahout. So, it would be really helpful if you can guide me towards other things that needs to be done. I won't mind looking into a new thing also which might involve some reading/analysis/homework.
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1242 (See https://builds.apache.org/job/Mahout-Quality/1242/)
        MAHOUT-843: removing redundant lines

        jeastman : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1212533
        Files :

        • /mahout/trunk/src/conf/clusterpp.props
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1242 (See https://builds.apache.org/job/Mahout-Quality/1242/ ) MAHOUT-843 : removing redundant lines jeastman : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1212533 Files : /mahout/trunk/src/conf/clusterpp.props
        Hide
        Jeff Eastman added a comment -

        The CLI is an important interface, since it enables scripting approaches to use of Mahout. As Windows users don't do much command line execution I can see that might take some study. CygWin has a pretty decent *UX capability and should prove usable. I will work on developing a CLI example using the post processor too so we can compare notes.

        I agree with your assessment that your initial ClusterConfigs and ClusterExecutors code is now unnecessary. I think it was a useful experiment which resulted in your factoring out the cluster output post processor and the result is modular and clean.

        In terms of our "to be implemented" clustering code I have an idea: The outlier pruning in MAHOUT-825 could be factored out of Canopy into another post processor instead of extending all of the other clustering algorithms with this capability. This should be a low hanging fruit for you after your top-down post processor work and would be a place where more sophisticated outlier rejection algorithms could be embedded later.

        We are targeting for a 0.6 code freeze at the end of December and any testing of the clustering code you can do in the interim would be beneficial. Heading into 0.7, I want to return to the classification/clustering convergence which has not gotten many of my cycles in quite a while. Take a look at ClusterClassifier, ClusteringPolicy and ClusterIterator. They use a pluggable framework to converge all of the iterative clustering algorithms (those which process all the input vectors in each iteration and which write their state in clusters-n) with the classification APIs. Given your work with ClusterConfigs and ClusterExecutors you might find this interesting.

        I'm going to close this issue now and will look forward to continuing our conversations on other issues.

        Show
        Jeff Eastman added a comment - The CLI is an important interface, since it enables scripting approaches to use of Mahout. As Windows users don't do much command line execution I can see that might take some study. CygWin has a pretty decent *UX capability and should prove usable. I will work on developing a CLI example using the post processor too so we can compare notes. I agree with your assessment that your initial ClusterConfigs and ClusterExecutors code is now unnecessary. I think it was a useful experiment which resulted in your factoring out the cluster output post processor and the result is modular and clean. In terms of our "to be implemented" clustering code I have an idea: The outlier pruning in MAHOUT-825 could be factored out of Canopy into another post processor instead of extending all of the other clustering algorithms with this capability. This should be a low hanging fruit for you after your top-down post processor work and would be a place where more sophisticated outlier rejection algorithms could be embedded later. We are targeting for a 0.6 code freeze at the end of December and any testing of the clustering code you can do in the interim would be beneficial. Heading into 0.7, I want to return to the classification/clustering convergence which has not gotten many of my cycles in quite a while. Take a look at ClusterClassifier, ClusteringPolicy and ClusterIterator. They use a pluggable framework to converge all of the iterative clustering algorithms (those which process all the input vectors in each iteration and which write their state in clusters-n) with the classification APIs. Given your work with ClusterConfigs and ClusterExecutors you might find this interesting. I'm going to close this issue now and will look forward to continuing our conversations on other issues.
        Hide
        Jeff Eastman added a comment -

        Closing this after successful development

        Show
        Jeff Eastman added a comment - Closing this after successful development

          People

          • Assignee:
            Jeff Eastman
            Reporter:
            Paritosh Ranjan
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development