Mahout
  1. Mahout
  2. MAHOUT-899

Add Point Sampling, Color coding to ClusterDumper

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.6
    • Component/s: Clustering, Integration
    • Labels:
      None

      Description

      When running the cluster dumper, or outputting values to a file for display purposes, it is useful to not have to deal with all the points per cluster. This issue will add the ability to specify a maximum number of points to output per cluster in the cluster dumper.

      1. MAHOUT-899.patch
        19 kB
        Grant Ingersoll
      2. MAHOUT-899.patch
        9 kB
        Grant Ingersoll

        Activity

        Hide
        Grant Ingersoll added a comment -

        Adds point sampling, also adds coloring to the output of the graph ml writer

        Show
        Grant Ingersoll added a comment - Adds point sampling, also adds coloring to the output of the graph ml writer
        Hide
        Lance Norskog added a comment -

        Suggestion on sampling: use reservoir sampling instead of "first N". With a fixed seed.

        Other ideas:

        • make edges directed based on distance from centroid. Given (n1, n2, e) and n1 is farther from the centroid than n2, make the edge point from n1 -> n2. This creates a cool partial ordering.
        • add node/edge weights based on whatever is appropriate in the algorithm.
        • add node weight based on distance.
        • if vectors are NamedVector add the name as a node attribute.

        I find Gephi baffling, but I'm sure an expert would find these cool additions.

        Show
        Lance Norskog added a comment - Suggestion on sampling: use reservoir sampling instead of "first N". With a fixed seed. Other ideas: make edges directed based on distance from centroid. Given (n1, n2, e) and n1 is farther from the centroid than n2, make the edge point from n1 -> n2. This creates a cool partial ordering. add node/edge weights based on whatever is appropriate in the algorithm. add node weight based on distance. if vectors are NamedVector add the name as a node attribute. I find Gephi baffling, but I'm sure an expert would find these cool additions.
        Hide
        Grant Ingersoll added a comment -

        The last one is implemented already, you just need to turn on the labels in gephi. I'll look into the weights thing.

        Show
        Grant Ingersoll added a comment - The last one is implemented already, you just need to turn on the labels in gephi. I'll look into the weights thing.
        Hide
        Grant Ingersoll added a comment -

        Here's a lame attempt at laying out the clusters in 2D in the GraphMLCluster by simply placing points in a cluster around a centroid. Basically, it shows the relationship between a point and it's centroid, but should not be misconstrued to show anything else.

        I'm looking for better layout mechanisms. Basically, need to be able to project an n-dimensional vector down to 2-D. Ideas most welcome. At a minimum, I like the refactoring done here.

        Show
        Grant Ingersoll added a comment - Here's a lame attempt at laying out the clusters in 2D in the GraphMLCluster by simply placing points in a cluster around a centroid. Basically, it shows the relationship between a point and it's centroid, but should not be misconstrued to show anything else. I'm looking for better layout mechanisms. Basically, need to be able to project an n-dimensional vector down to 2-D. Ideas most welcome. At a minimum, I like the refactoring done here.
        Hide
        Grant Ingersoll added a comment -

        I'm going to commit what I have for now for this release, and then we can iterate

        Show
        Grant Ingersoll added a comment - I'm going to commit what I have for now for this release, and then we can iterate
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1296 (See https://builds.apache.org/job/Mahout-Quality/1296/)
        MAHOUT-899: Add some more cluster dumping options like sampling, coloring

        gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1228948
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/common/StringUtils.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/common/StringUtilsTest.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/clustering/AbstractClusterWriter.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/clustering/CSVClusterWriter.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/clustering/ClusterDumper.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/clustering/ClusterDumperWriter.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/clustering/GraphMLClusterWriter.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1296 (See https://builds.apache.org/job/Mahout-Quality/1296/ ) MAHOUT-899 : Add some more cluster dumping options like sampling, coloring gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1228948 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/common/StringUtils.java /mahout/trunk/core/src/test/java/org/apache/mahout/common/StringUtilsTest.java /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/clustering/AbstractClusterWriter.java /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/clustering/CSVClusterWriter.java /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/clustering/ClusterDumper.java /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/clustering/ClusterDumperWriter.java /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/clustering/GraphMLClusterWriter.java

          People

          • Assignee:
            Grant Ingersoll
            Reporter:
            Grant Ingersoll
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development