Mahout
  1. Mahout
  2. MAHOUT-854

Add MinHash to build-reuters.sh example

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.6
    • Component/s: Clustering, Examples
    • Labels:
      None

      Description

      We can use the Reuters data set for MinHash clustering. Thus adding the MinHash algorithm to the build-reuters.sh would be nice.

      1. MAHOUT-854.patch
        1 kB
        Varun Thacker

        Activity

        Hide
        Jeff Eastman added a comment -

        Can this issue be closed?

        Show
        Jeff Eastman added a comment - Can this issue be closed?
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1302 (See https://builds.apache.org/job/Mahout-Quality/1302/)
        MAHOUT-854: add in overwrite option for Minhash

        gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1230780
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/minhash/MinHashDriver.java
        • /mahout/trunk/examples/bin/cluster-reuters.sh
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1302 (See https://builds.apache.org/job/Mahout-Quality/1302/ ) MAHOUT-854 : add in overwrite option for Minhash gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1230780 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/minhash/MinHashDriver.java /mahout/trunk/examples/bin/cluster-reuters.sh
        Hide
        Grant Ingersoll added a comment -

        I added in an overwrite option. Let's see if Jenkins is happy with that.

        Show
        Grant Ingersoll added a comment - I added in an overwrite option. Let's see if Jenkins is happy with that.
        Hide
        Jeff Eastman added a comment -

        Reopening since this appears to be related to a current Jenkins build failure:

        Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /tmp/mahout-work-jenkins/reuters-minhash already exists
        at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:134)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:846)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:807)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)

        Show
        Jeff Eastman added a comment - Reopening since this appears to be related to a current Jenkins build failure: Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /tmp/mahout-work-jenkins/reuters-minhash already exists at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:134) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:846) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:807) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396)
        Hide
        Grant Ingersoll added a comment -

        It's hooked in.

        Show
        Grant Ingersoll added a comment - It's hooked in.
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1133 (See https://builds.apache.org/job/Mahout-Quality/1133/)
        MAHOUT-854: add minhash example

        gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1196448
        Files :

        • /mahout/trunk/examples/bin/build-reuters.sh
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1133 (See https://builds.apache.org/job/Mahout-Quality/1133/ ) MAHOUT-854 : add minhash example gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1196448 Files : /mahout/trunk/examples/bin/build-reuters.sh
        Hide
        Varun Thacker added a comment -

        I will look into ClusterDump and keep everyone posted on the issue.

        Show
        Varun Thacker added a comment - I will look into ClusterDump and keep everyone posted on the issue.
        Hide
        Grant Ingersoll added a comment -

        I've committed this, but will leave the issue open for now, pending ClusterDump updates if Varun wishes to tackle it.

        Show
        Grant Ingersoll added a comment - I've committed this, but will leave the issue open for now, pending ClusterDump updates if Varun wishes to tackle it.
        Hide
        Grant Ingersoll added a comment -

        1. Is it just me or when I try running the script using any of the clustering algorithms I get this error:

        Works for me

        2. Regarding MinHash is the clusterdump part required? I

        Not necessarily, but the whole point is to somehow show some output. I guess we'd have to modify clusterdump to handle this format, since it is different from existing ones.

        Show
        Grant Ingersoll added a comment - 1. Is it just me or when I try running the script using any of the clustering algorithms I get this error: Works for me 2. Regarding MinHash is the clusterdump part required? I Not necessarily, but the whole point is to somehow show some output. I guess we'd have to modify clusterdump to handle this format, since it is different from existing ones.
        Hide
        Varun Thacker added a comment -

        I am not sure on 2 things:

        1. Is it just me or when I try running the script using any of the clustering algorithms I get this error:

        ./build-reuters.sh: line 165: 17319 Killed                  $MAHOUT seq2sparse -i ${WORK_DIR}/reuters-out-seqdir/ -o ${WORK_DIR}/reuters-out-seqdir-sparse-kmeans 

        2. Regarding MinHash is the clusterdump part required? If yes then can someone tell me what needs to be done to implement it for MinHash. I'm not to sure on how to implement it in case it is needed.

        Show
        Varun Thacker added a comment - I am not sure on 2 things: 1. Is it just me or when I try running the script using any of the clustering algorithms I get this error: ./build-reuters.sh: line 165: 17319 Killed $MAHOUT seq2sparse -i ${WORK_DIR}/reuters-out-seqdir/ -o ${WORK_DIR}/reuters-out-seqdir-sparse-kmeans 2. Regarding MinHash is the clusterdump part required? If yes then can someone tell me what needs to be done to implement it for MinHash. I'm not to sure on how to implement it in case it is needed.

          People

          • Assignee:
            Grant Ingersoll
            Reporter:
            Varun Thacker
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development