I'm running into a challenge integrating Fuzzy KMeans (and Dirichlet) into this evaluator. Currently the clustering step of the fuzzyK emits the vector as key and a FuzzyKMeansOutput writable as the value of the sequence file. This is backwards from the [clusterId :: VectorWritable] encoding that the patch uses for Canopy and KMeans. Also the Fuzzy...Output bean contains all of the clusters and the probability the vector is a member of each; rather large to be a key.
For CDbw to find the reference points it really needs to iterate over [clusterId :: VectorWritable] pairs and this begs the question of what to do with fuzzy membership. I don't know if CDbw can be adjusted to handle fuzzyness in general but it will probably will work with some points assigned to more than one cluster. Does it make sense to apply a settable threshold to the clustering step so that all points with cluster membership probability > threshold would be assigned to that cluster?
This would work also for Dirichlet. To implement in fuzzyK I would need to change the FuzzyKMeansClusterer and FuzzyKMeansClusterMapper to match the other clustering jobs.
Does this make sense?