Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-854

Add MinHash to build-reuters.sh example

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.6
    • Component/s: Clustering, Examples
    • Labels:
      None

      Description

      We can use the Reuters data set for MinHash clustering. Thus adding the MinHash algorithm to the build-reuters.sh would be nice.

      1. MAHOUT-854.patch
        1 kB
        Varun Thacker

        Activity

        Hide
        varunthacker Varun Thacker added a comment -

        I am not sure on 2 things:

        1. Is it just me or when I try running the script using any of the clustering algorithms I get this error:

        ./build-reuters.sh: line 165: 17319 Killed                  $MAHOUT seq2sparse -i ${WORK_DIR}/reuters-out-seqdir/ -o ${WORK_DIR}/reuters-out-seqdir-sparse-kmeans 

        2. Regarding MinHash is the clusterdump part required? If yes then can someone tell me what needs to be done to implement it for MinHash. I'm not to sure on how to implement it in case it is needed.

        Show
        varunthacker Varun Thacker added a comment - I am not sure on 2 things: 1. Is it just me or when I try running the script using any of the clustering algorithms I get this error: ./build-reuters.sh: line 165: 17319 Killed $MAHOUT seq2sparse -i ${WORK_DIR}/reuters-out-seqdir/ -o ${WORK_DIR}/reuters-out-seqdir-sparse-kmeans 2. Regarding MinHash is the clusterdump part required? If yes then can someone tell me what needs to be done to implement it for MinHash. I'm not to sure on how to implement it in case it is needed.
        Hide
        gsingers Grant Ingersoll added a comment -

        1. Is it just me or when I try running the script using any of the clustering algorithms I get this error:

        Works for me

        2. Regarding MinHash is the clusterdump part required? I

        Not necessarily, but the whole point is to somehow show some output. I guess we'd have to modify clusterdump to handle this format, since it is different from existing ones.

        Show
        gsingers Grant Ingersoll added a comment - 1. Is it just me or when I try running the script using any of the clustering algorithms I get this error: Works for me 2. Regarding MinHash is the clusterdump part required? I Not necessarily, but the whole point is to somehow show some output. I guess we'd have to modify clusterdump to handle this format, since it is different from existing ones.
        Hide
        gsingers Grant Ingersoll added a comment -

        I've committed this, but will leave the issue open for now, pending ClusterDump updates if Varun wishes to tackle it.

        Show
        gsingers Grant Ingersoll added a comment - I've committed this, but will leave the issue open for now, pending ClusterDump updates if Varun wishes to tackle it.
        Hide
        varunthacker Varun Thacker added a comment -

        I will look into ClusterDump and keep everyone posted on the issue.

        Show
        varunthacker Varun Thacker added a comment - I will look into ClusterDump and keep everyone posted on the issue.
        Hide
        hudson Hudson added a comment -

        Integrated in Mahout-Quality #1133 (See https://builds.apache.org/job/Mahout-Quality/1133/)
        MAHOUT-854: add minhash example

        gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1196448
        Files :

        • /mahout/trunk/examples/bin/build-reuters.sh
        Show
        hudson Hudson added a comment - Integrated in Mahout-Quality #1133 (See https://builds.apache.org/job/Mahout-Quality/1133/ ) MAHOUT-854 : add minhash example gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1196448 Files : /mahout/trunk/examples/bin/build-reuters.sh
        Hide
        gsingers Grant Ingersoll added a comment -

        It's hooked in.

        Show
        gsingers Grant Ingersoll added a comment - It's hooked in.
        Hide
        jeastman Jeff Eastman added a comment -

        Reopening since this appears to be related to a current Jenkins build failure:

        Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /tmp/mahout-work-jenkins/reuters-minhash already exists
        at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:134)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:846)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:807)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)

        Show
        jeastman Jeff Eastman added a comment - Reopening since this appears to be related to a current Jenkins build failure: Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /tmp/mahout-work-jenkins/reuters-minhash already exists at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:134) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:846) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:807) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396)
        Hide
        gsingers Grant Ingersoll added a comment -

        I added in an overwrite option. Let's see if Jenkins is happy with that.

        Show
        gsingers Grant Ingersoll added a comment - I added in an overwrite option. Let's see if Jenkins is happy with that.
        Hide
        hudson Hudson added a comment -

        Integrated in Mahout-Quality #1302 (See https://builds.apache.org/job/Mahout-Quality/1302/)
        MAHOUT-854: add in overwrite option for Minhash

        gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1230780
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/minhash/MinHashDriver.java
        • /mahout/trunk/examples/bin/cluster-reuters.sh
        Show
        hudson Hudson added a comment - Integrated in Mahout-Quality #1302 (See https://builds.apache.org/job/Mahout-Quality/1302/ ) MAHOUT-854 : add in overwrite option for Minhash gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1230780 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/minhash/MinHashDriver.java /mahout/trunk/examples/bin/cluster-reuters.sh
        Hide
        jeastman Jeff Eastman added a comment -

        Can this issue be closed?

        Show
        jeastman Jeff Eastman added a comment - Can this issue be closed?

          People

          • Assignee:
            gsingers Grant Ingersoll
            Reporter:
            varunthacker Varun Thacker
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development