Commons Math
  1. Commons Math
  2. MATH-1031

Refactoring: Move variance calculation of a centroid cluster to its class

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 3.2
    • Fix Version/s: 3.3
    • Labels:
      None

      Description

      Users might be interested in assessing the quality of each cluster in the calculated clustering. This can be performed by calculating its variance.
      The variance calculation is actually performed in other places (e.g. for the MultiKMeans), but not available to end users.
      I'd propose to add the functionality into the CentroidCluster. The one issue to consider is that the cluster does not know based on which distance measure it was calculated. In the implementation, I chose to parametrize the method with a distance measure which enables users to also compare the quality based on various distance measures. Alternatively, it would be possible to add the distance measure as a field, which is set by the clustering algorithm.
      In the patch I went for the first method and also changed the 2 other places where variance calculation is performed to use the new feature.

      1. centroid.patch
        3 kB
        Thorsten Schäfer

        Activity

        Hide
        Thorsten Schäfer added a comment -

        Added path with new method and refactored other classes that use variance calculation

        Show
        Thorsten Schäfer added a comment - Added path with new method and refactored other classes that use variance calculation
        Hide
        Thomas Neidhart added a comment -

        I thought about this myself, but I was thinking about a different solution.
        The MultiKMeans algorithm for example, uses the variance method to evaluate how "good" a clustering has been, but this should be made more flexible, thus I would propose to create a new Interface, e.g. "ClusterEvaluation", which performs this kind of calculation, and can be used at different places. This interface can then be provided as an argument to the clustering algorithm and later be used to evaluate the results.

        Show
        Thomas Neidhart added a comment - I thought about this myself, but I was thinking about a different solution. The MultiKMeans algorithm for example, uses the variance method to evaluate how "good" a clustering has been, but this should be made more flexible, thus I would propose to create a new Interface, e.g. "ClusterEvaluation", which performs this kind of calculation, and can be used at different places. This interface can then be provided as an argument to the clustering algorithm and later be used to evaluate the results.
        Hide
        Thorsten Schäfer added a comment -

        Yes, if there is need for additional flexibility, your solution seems better. The ClusterEvaluation could also be used in a divisive hierarchical cluster algorithm to choose the cluster which needs to get split next.

        Show
        Thorsten Schäfer added a comment - Yes, if there is need for additional flexibility, your solution seems better. The ClusterEvaluation could also be used in a divisive hierarchical cluster algorithm to choose the cluster which needs to get split next.
        Hide
        Thomas Neidhart added a comment -

        Yes indeed, I plan to commit this change soon.

        btw. there is also MATH-959 to add a hierarchical clusterer to CM.
        I have already added a preliminary patch which contains an optimial algorithm for single-link.

        The remaining link-methods are still to be implemented or will be implemented with a naive algorithm which is straight-forwards.

        In case you have some interest in this.

        Show
        Thomas Neidhart added a comment - Yes indeed, I plan to commit this change soon. btw. there is also MATH-959 to add a hierarchical clusterer to CM. I have already added a preliminary patch which contains an optimial algorithm for single-link. The remaining link-methods are still to be implemented or will be implemented with a naive algorithm which is straight-forwards. In case you have some interest in this.
        Hide
        Thomas Neidhart added a comment -

        Applied changes in r1542545.

        It would be good if we add additional cluster evaluation methods, as described in
        http://en.wikipedia.org/wiki/Cluster_analysis#Evaluation_of_clustering_results

        Show
        Thomas Neidhart added a comment - Applied changes in r1542545. It would be good if we add additional cluster evaluation methods, as described in http://en.wikipedia.org/wiki/Cluster_analysis#Evaluation_of_clustering_results
        Hide
        Luc Maisonobe added a comment -

        Closing all resolved issue now available in released 3.3 version.

        Show
        Luc Maisonobe added a comment - Closing all resolved issue now available in released 3.3 version.

          People

          • Assignee:
            Unassigned
            Reporter:
            Thorsten Schäfer
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development