Hi Tony. Nice work on the patch. But before we commit this, there are a couple of things you need to cover. I still have to read the algorithm in detail to know whats happening. But I have some queries and suggestions below which is a kind of a checklist to make this a commitable patch
1) I am not a fan of Text based input, though it is what most of the algorithms in Mahout was first implement in. The idea of splitting and joining text files based on comma is not very clean. Can you convert this to deal with SequenceFile of VectorWritable OR some other Writable Format? Whats your input schema?
2) There is a code-style we enforce in Mahout. You can use the mvn checkstyle:checkstyle to see the violations. We also have an eclipse formatter which formats code that almost match the checkstyle(there are rare manual interventions required). Take a look at this https://cwiki.apache.org/MAHOUT/howtocontribute.html you will find the Eclipse formatter file at the bottom
3) For parsing args use the apache commons cli2 library. Take a look at o/a/m/clustering/kmeans/KMeansDriver to see usage
4) What is Utils being used for?
+ public void setup(Context context) throws IOException,InterruptedException
+ String filePath = context.getConfiguration().get("a");
+ sumAttribute = Utils.readFile(filePath+"/part-r-00000");
Please use distributed cache to read the file in a map/reduce context. See the DictionaryVectorizer Map/Reduce classes for usage
6) job.setNumReduceTasks(1); ? Is this necessary? Doesn't it hurt scalability of this algorithm? Is the single reducer going to get a lot of data from the mapper? If Yes, then you should think of removing this constraint and let it use the hadoop parameters or parameterize it
7) Can this job be Optimised using a Combiner? If yes, its really worth spending time to make one