Mahout
  1. Mahout
  2. MAHOUT-854

Add MinHash to build-reuters.sh example

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.6
    • Component/s: Clustering, Examples
    • Labels:
      None

      Description

      We can use the Reuters data set for MinHash clustering. Thus adding the MinHash algorithm to the build-reuters.sh would be nice.

      1. MAHOUT-854.patch
        1 kB
        Varun Thacker

        Activity

        Varun Thacker created issue -
        Varun Thacker made changes -
        Field Original Value New Value
        Description We can use the Reuters data set for MinHash clustering. Thus adding the MinHash algorithm to the build-reuters.sh would be nice.
        Component/s Clustering [ 12312151 ]
        Component/s Examples [ 12314318 ]
        Hide
        Varun Thacker added a comment -

        I am not sure on 2 things:

        1. Is it just me or when I try running the script using any of the clustering algorithms I get this error:

        ./build-reuters.sh: line 165: 17319 Killed                  $MAHOUT seq2sparse -i ${WORK_DIR}/reuters-out-seqdir/ -o ${WORK_DIR}/reuters-out-seqdir-sparse-kmeans 

        2. Regarding MinHash is the clusterdump part required? If yes then can someone tell me what needs to be done to implement it for MinHash. I'm not to sure on how to implement it in case it is needed.

        Show
        Varun Thacker added a comment - I am not sure on 2 things: 1. Is it just me or when I try running the script using any of the clustering algorithms I get this error: ./build-reuters.sh: line 165: 17319 Killed $MAHOUT seq2sparse -i ${WORK_DIR}/reuters-out-seqdir/ -o ${WORK_DIR}/reuters-out-seqdir-sparse-kmeans 2. Regarding MinHash is the clusterdump part required? If yes then can someone tell me what needs to be done to implement it for MinHash. I'm not to sure on how to implement it in case it is needed.
        Varun Thacker made changes -
        Attachment MAHOUT-854.patch [ 12501525 ]
        Grant Ingersoll made changes -
        Assignee Grant Ingersoll [ gsingers ]
        Hide
        Grant Ingersoll added a comment -

        1. Is it just me or when I try running the script using any of the clustering algorithms I get this error:

        Works for me

        2. Regarding MinHash is the clusterdump part required? I

        Not necessarily, but the whole point is to somehow show some output. I guess we'd have to modify clusterdump to handle this format, since it is different from existing ones.

        Show
        Grant Ingersoll added a comment - 1. Is it just me or when I try running the script using any of the clustering algorithms I get this error: Works for me 2. Regarding MinHash is the clusterdump part required? I Not necessarily, but the whole point is to somehow show some output. I guess we'd have to modify clusterdump to handle this format, since it is different from existing ones.
        Hide
        Grant Ingersoll added a comment -

        I've committed this, but will leave the issue open for now, pending ClusterDump updates if Varun wishes to tackle it.

        Show
        Grant Ingersoll added a comment - I've committed this, but will leave the issue open for now, pending ClusterDump updates if Varun wishes to tackle it.
        Hide
        Varun Thacker added a comment -

        I will look into ClusterDump and keep everyone posted on the issue.

        Show
        Varun Thacker added a comment - I will look into ClusterDump and keep everyone posted on the issue.
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1133 (See https://builds.apache.org/job/Mahout-Quality/1133/)
        MAHOUT-854: add minhash example

        gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1196448
        Files :

        • /mahout/trunk/examples/bin/build-reuters.sh
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1133 (See https://builds.apache.org/job/Mahout-Quality/1133/ ) MAHOUT-854 : add minhash example gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1196448 Files : /mahout/trunk/examples/bin/build-reuters.sh
        Hide
        Grant Ingersoll added a comment -

        It's hooked in.

        Show
        Grant Ingersoll added a comment - It's hooked in.
        Grant Ingersoll made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Jeff Eastman added a comment -

        Reopening since this appears to be related to a current Jenkins build failure:

        Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /tmp/mahout-work-jenkins/reuters-minhash already exists
        at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:134)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:846)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:807)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)

        Show
        Jeff Eastman added a comment - Reopening since this appears to be related to a current Jenkins build failure: Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /tmp/mahout-work-jenkins/reuters-minhash already exists at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:134) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:846) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:807) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396)
        Jeff Eastman made changes -
        Resolution Fixed [ 1 ]
        Status Resolved [ 5 ] Reopened [ 4 ]
        Hide
        Grant Ingersoll added a comment -

        I added in an overwrite option. Let's see if Jenkins is happy with that.

        Show
        Grant Ingersoll added a comment - I added in an overwrite option. Let's see if Jenkins is happy with that.
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1302 (See https://builds.apache.org/job/Mahout-Quality/1302/)
        MAHOUT-854: add in overwrite option for Minhash

        gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1230780
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/minhash/MinHashDriver.java
        • /mahout/trunk/examples/bin/cluster-reuters.sh
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1302 (See https://builds.apache.org/job/Mahout-Quality/1302/ ) MAHOUT-854 : add in overwrite option for Minhash gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1230780 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/minhash/MinHashDriver.java /mahout/trunk/examples/bin/cluster-reuters.sh
        Hide
        Jeff Eastman added a comment -

        Can this issue be closed?

        Show
        Jeff Eastman added a comment - Can this issue be closed?
        Grant Ingersoll made changes -
        Status Reopened [ 4 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Sean Owen made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Grant Ingersoll
            Reporter:
            Varun Thacker
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development