Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-588

Benchmark Mahout's clustering performance on EC2 and publish the results

    XMLWordPrintableJSON

Details

    • Task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.5
    • 0.5
    • None
    • None

    Description

      For Taming Text, I've commissioned some benchmarking work on Mahout's clustering algorithms. I've asked the two doing the project to do all the work in the open here. The goal is to use a publicly reusable dataset (for now, the ASF mail archives, assuming it is big enough) and run on EC2 and make all resources available so others can reproduce/improve.

      I'd like to add the setup code to utils (although it could possibly be done as a Vectorizer) and the publication of the results will be put up on the Wiki as well as in the book. This issue is to track the patches, etc.

      Attachments

        1. prep_asf_mail_archives.sh
          4 kB
          Timothy Potter
        2. MAHOUT-588.patch
          35 kB
          Timothy Potter
        3. SequenceFilesFromMailArchivesTest.java
          7 kB
          Timothy Potter
        4. MailArchivesClusteringAnalyzerTest.java
          2 kB
          Timothy Potter
        5. SequenceFilesFromMailArchives.java
          12 kB
          Timothy Potter
        6. MailArchivesClusteringAnalyzer.java
          8 kB
          Timothy Potter
        7. mahout-588_canopy.pdf
          161 kB
          Szymon Chojnacki
        8. ec2_setup_notes_v2.txt
          6 kB
          Timothy Potter
        9. prep_asf_mail_archives.sh
          3 kB
          Timothy Potter
        10. ec2_setup_notes_v2.txt
          6 kB
          Timothy Potter
        11. prep_asf_mail_archives.sh
          3 kB
          Timothy Potter
        12. mahout-588_distribution.pdf
          311 kB
          Szymon Chojnacki
        13. TamingSubsetMapper.java
          0.9 kB
          Szymon Chojnacki
        14. TamingSubset.java
          2 kB
          Szymon Chojnacki
        15. 60_clusters_kmeans_10_iterations_100K_coordinates.txt
          7 kB
          Szymon Chojnacki
        16. TamingAnalyzer.java
          2 kB
          Timothy Potter
        17. TamingAnalyzerTest.java
          1 kB
          Timothy Potter
        18. ec2_setup_notes.txt
          6 kB
          Timothy Potter
        19. clusters1.txt
          203 kB
          Szymon Chojnacki
        20. TamingTFIDF.java
          0.9 kB
          Szymon Chojnacki
        21. TamingCollocMapper.java
          7 kB
          Szymon Chojnacki
        22. TamingDictionaryVectorizer.java
          14 kB
          Szymon Chojnacki
        23. TamingGramKeyGroupComparator.java
          0.7 kB
          Szymon Chojnacki
        24. TamingCollocDriver.java
          10 kB
          Szymon Chojnacki
        25. TamingDictVect.java
          1 kB
          Szymon Chojnacki
        26. TamingAnalyzer.java
          3 kB
          Szymon Chojnacki
        27. TamingTokenizer.java
          0.8 kB
          Szymon Chojnacki
        28. clusters_kMeans.txt
          11 kB
          Szymon Chojnacki
        29. Top1000Tokens_maybe_stopWords
          14 kB
          Szymon Chojnacki
        30. distcp_large_to_s3_failed.log
          47 kB
          Timothy Potter
        31. seq2sparse_small_failed.log
          118 kB
          Timothy Potter
        32. seq2sparse_xlarge_ok.log
          230 kB
          Timothy Potter
        33. SequenceFilesFromMailArchives2.java
          10 kB
          Szymon Chojnacki
        34. Uncompress.java
          4 kB
          Szymon Chojnacki
        35. SequenceFilesFromMailArchives.java
          12 kB
          Timothy Potter

        Issue Links

          Activity

            People

              gsingers Grant Ingersoll
              gsingers Grant Ingersoll
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: