I am a bit confused.
Are we planning to get rid of the way clustering is being done currently, which is algorithms specific? i.e. the code in CanopyClusterer.
Will the new clustering strategy be "only" what is implemented in ClusterClassifier? i.e. Calculating probabilities of vectors belonging to different models (clusters) and choose the model with highest probability?
If yes, then Implementing Clustering policy for different clustering algorithms is all that is needed. And for outlier removal, just a threshold probability will be needed. All vectors below that probability won't be clustered. Am I correct?
Till now, I have been thinking that the clustering code just needs to be refactored out ( without changing the implementation ). If this is the case, then, I think, I have been proceeding in the correct direction ( in terms of design ).
However, I am doubting that we are not in sync regarding the way of implementation. I think you want to change the clustering implementation to a cluster classification implementation, with outlier removal ( and completely get rid of the algorithm specific implementation, which makes sense ).
So, it would be really helpful if you can clarify my doubts.