Still reading code to get a deeper understanding of what's happening, some more questions:
1)The createClusterWriter method inside ClusterDumper creates 3 types of writers depending on the outputFormat, so one of the arguments to these writers is the map in question is shown below:
private Map<Integer, List<WeightedVectorWritable>> clusterIdToPoints;
Its not clear to me whether we need to do a deeper refactoring to rewrite/replace these different types of writers with the ClusterOutputPostProcessor, any thoughts on this, should we have a choice to either use the writers or the ClusterOutputPostProcessor?
2) For the following line of code:
long numWritten = clusterWriter.write(new SequenceFileDirValueIterable<ClusterWritable>(new Path(seqFileDir, "part-*"), PathType.GLOB, conf));
Does the above just use an iterator to dump the points to different directories corresponding to the different clusters, the code is really hard to read and SequenceFileDirValueIterable is not well commented.
Thanks for your help in getting a better understanding of this.