Mahout
  1. Mahout
  2. MAHOUT-778

Mark folder name of final clustering iteration with pattern such as 'cluster-n-last'

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.5
    • Fix Version/s: 0.6
    • Component/s: Clustering
    • Labels:
      None

      Description

      It would be useful if the KMeans, FuzzyKMeans would specify the last cluster iteration folder with a pattern such as 'cluster-n-last'.

      At the moment it is difficult to configure other programs to process clustering results since the number of actual iterations is not known up front.

      A PathFilder similar to ClustersFilter could be created which filters folders on the pattern 'cluster-*-last' in order to determine the folder.

      1. MAHOUT-778.patch
        14 kB
        Frank Scholten
      2. MAHOUT-778.patch
        16 kB
        Sean Owen
      3. MAHOUT-778.patch
        5 kB
        Sean Owen
      4. MAHOUT-778-ClustersFilter.patch
        5 kB
        Frank Scholten

        Activity

        Hide
        Sean Owen added a comment -

        How does this look? It adds an empty file called "cluster-N-last", where "cluster-N" is the last directory created and so has the result. I was reluctant to move or rename where the data ends up so as to break less stuff; this is a smaller change.

        Show
        Sean Owen added a comment - How does this look? It adds an empty file called "cluster-N-last", where "cluster-N" is the last directory created and so has the result. I was reluctant to move or rename where the data ends up so as to break less stuff; this is a smaller change.
        Hide
        Jeff Eastman added a comment -

        +1 Looks like a minimalist solution to a common problem. The alternative is to iterate over the clusters-n directories looking for the highest n. This makes it a simple deterministic computation.

        Show
        Jeff Eastman added a comment - +1 Looks like a minimalist solution to a common problem. The alternative is to iterate over the clusters-n directories looking for the highest n. This makes it a simple deterministic computation.
        Hide
        Jeff Eastman added a comment -

        +1 Looks like a minimalist solution to a common problem. The alternative is to iterate over the clusters-n directories looking for the highest n. This makes it a simple deterministic computation.

        Show
        Jeff Eastman added a comment - +1 Looks like a minimalist solution to a common problem. The alternative is to iterate over the clusters-n directories looking for the highest n. This makes it a simple deterministic computation.
        Hide
        Sean Owen added a comment -

        (I'm also happy to create a simple method that would go find the dir with the highest number. Would that be better? I am not sure if Frank was suggesting this was the extent of the pain to be solved, or whether the problem is more generally that you don't know at any given time whether there's another iteration coming, and to wait. In which case, yeah, we need a "done" marker like this. Frank, preferences?)

        Show
        Sean Owen added a comment - (I'm also happy to create a simple method that would go find the dir with the highest number. Would that be better? I am not sure if Frank was suggesting this was the extent of the pain to be solved, or whether the problem is more generally that you don't know at any given time whether there's another iteration coming, and to wait. In which case, yeah, we need a "done" marker like this. Frank, preferences?)
        Hide
        Frank Scholten added a comment - - edited

        +1 for the done marker

        I modified ClustersFilter from examples (Used by the Display* Classes).

        You can pass a Configuration object and a cluster output Path to the constructor and it will accept the last iteration path if the done marker is present. Otherwise it accepts 'clusters-0'.

        Let me know what you think.

        There is still the problem that you need to pass in the exact last iteration path to ClusterDumper. If the last iteration path is renamed to 'clusters-n-done' you can use a glob for command line Mahout:

        --output=clusters-*-done
        

        but this would break existing things.

        Show
        Frank Scholten added a comment - - edited +1 for the done marker I modified ClustersFilter from examples (Used by the Display* Classes). You can pass a Configuration object and a cluster output Path to the constructor and it will accept the last iteration path if the done marker is present. Otherwise it accepts 'clusters-0'. Let me know what you think. There is still the problem that you need to pass in the exact last iteration path to ClusterDumper. If the last iteration path is renamed to 'clusters-n-done' you can use a glob for command line Mahout: --output=clusters-*-done but this would break existing things.
        Hide
        Lance Norskog added a comment -

        Could the final iteration files be renamed at the end of the job? This convention would make other iterated algorithms easier to process.

        Show
        Lance Norskog added a comment - Could the final iteration files be renamed at the end of the job? This convention would make other iterated algorithms easier to process.
        Hide
        Sean Owen added a comment -

        It could... what to rename it to though? I'm concerned that would break any users that expect the current behavior. This at least leaves the behavior as-is.

        Show
        Sean Owen added a comment - It could... what to rename it to though? I'm concerned that would break any users that expect the current behavior. This at least leaves the behavior as-is.
        Hide
        Jeff Eastman added a comment -

        There is an existing method, TestClusterDumper.finalClusterPath(), that could be moved to resolve this programmatically. Alternatively, appending "-final" or somesuch to the final clusters directory would work and has some appeal to me. I think the impact of the latter would be minimal and that users would welcome it. Better to do the right thing now than postpone, IMHO.

        Show
        Jeff Eastman added a comment - There is an existing method, TestClusterDumper.finalClusterPath(), that could be moved to resolve this programmatically. Alternatively, appending "-final" or somesuch to the final clusters directory would work and has some appeal to me. I think the impact of the latter would be minimal and that users would welcome it. Better to do the right thing now than postpone, IMHO.
        Hide
        Sean Owen added a comment -

        OK here's an omnibus patch, including Jeff's idea. It shows the extent of the caller change, by showing how unit tests have to change. Not a big change, but does break anyone expecting to find clusters-N. If Frank / Jeff / Robin think that's OK (and I think that sounds OK), I'll commit.

        Show
        Sean Owen added a comment - OK here's an omnibus patch, including Jeff's idea. It shows the extent of the caller change, by showing how unit tests have to change. Not a big change, but does break anyone expecting to find clusters-N. If Frank / Jeff / Robin think that's OK (and I think that sounds OK), I'll commit.
        Hide
        Frank Scholten added a comment - - edited

        Fixed the ClustersFilter so it accepts paths starting with "clusters-" and ending with "-final".

        Extracted "-final" as a constant in Cluster and updated the build-reuters script so it uses a glob for the cluster output path.

        Show
        Frank Scholten added a comment - - edited Fixed the ClustersFilter so it accepts paths starting with "clusters-" and ending with "-final". Extracted "-final" as a constant in Cluster and updated the build-reuters script so it uses a glob for the cluster output path.
        Hide
        Jeff Eastman added a comment -

        +1 looks reasonable to me

        Show
        Jeff Eastman added a comment - +1 looks reasonable to me
        Hide
        Sean Owen added a comment -

        Committed Frank's latest, with additional test changes to make them pass

        Show
        Sean Owen added a comment - Committed Frank's latest, with additional test changes to make them pass
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1073 (See https://builds.apache.org/job/Mahout-Quality/1073/)
        MAHOUT-778 label final output as "clusters-N-final"

        srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1177786
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/Cluster.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/dirichlet/DirichletDriver.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/fuzzykmeans/FuzzyKMeansDriver.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansDriver.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopyDriver.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/meanshift/TestMeanShift.java
        • /mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/ClustersFilter.java
        • /mahout/trunk/examples/src/test/java/org/apache/mahout/clustering
        • /mahout/trunk/examples/src/test/java/org/apache/mahout/clustering/display
        • /mahout/trunk/examples/src/test/java/org/apache/mahout/clustering/display/ClustersFilterTest.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/clustering/TestClusterEvaluator.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1073 (See https://builds.apache.org/job/Mahout-Quality/1073/ ) MAHOUT-778 label final output as "clusters-N-final" srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1177786 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/Cluster.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/dirichlet/DirichletDriver.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/fuzzykmeans/FuzzyKMeansDriver.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansDriver.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopyDriver.java /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/meanshift/TestMeanShift.java /mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/ClustersFilter.java /mahout/trunk/examples/src/test/java/org/apache/mahout/clustering /mahout/trunk/examples/src/test/java/org/apache/mahout/clustering/display /mahout/trunk/examples/src/test/java/org/apache/mahout/clustering/display/ClustersFilterTest.java /mahout/trunk/integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java /mahout/trunk/integration/src/test/java/org/apache/mahout/clustering/TestClusterEvaluator.java

          People

          • Assignee:
            Sean Owen
            Reporter:
            Frank Scholten
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development