Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-588

Benchmark Mahout's clustering performance on EC2 and publish the results

    XMLWordPrintableJSON

Details

    • Task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.5
    • 0.5
    • None
    • None

    Description

      For Taming Text, I've commissioned some benchmarking work on Mahout's clustering algorithms. I've asked the two doing the project to do all the work in the open here. The goal is to use a publicly reusable dataset (for now, the ASF mail archives, assuming it is big enough) and run on EC2 and make all resources available so others can reproduce/improve.

      I'd like to add the setup code to utils (although it could possibly be done as a Vectorizer) and the publication of the results will be put up on the Wiki as well as in the book. This issue is to track the patches, etc.

      Attachments

        1. SequenceFilesFromMailArchives.java
          12 kB
          Timothy Potter
        2. Uncompress.java
          4 kB
          Szymon Chojnacki
        3. SequenceFilesFromMailArchives2.java
          10 kB
          Szymon Chojnacki
        4. seq2sparse_xlarge_ok.log
          230 kB
          Timothy Potter
        5. seq2sparse_small_failed.log
          118 kB
          Timothy Potter
        6. distcp_large_to_s3_failed.log
          47 kB
          Timothy Potter
        7. Top1000Tokens_maybe_stopWords
          14 kB
          Szymon Chojnacki
        8. clusters_kMeans.txt
          11 kB
          Szymon Chojnacki
        9. TamingTokenizer.java
          0.8 kB
          Szymon Chojnacki
        10. TamingAnalyzer.java
          3 kB
          Szymon Chojnacki
        11. TamingDictVect.java
          1 kB
          Szymon Chojnacki
        12. TamingCollocDriver.java
          10 kB
          Szymon Chojnacki
        13. TamingGramKeyGroupComparator.java
          0.7 kB
          Szymon Chojnacki
        14. TamingDictionaryVectorizer.java
          14 kB
          Szymon Chojnacki
        15. TamingCollocMapper.java
          7 kB
          Szymon Chojnacki
        16. TamingTFIDF.java
          0.9 kB
          Szymon Chojnacki
        17. clusters1.txt
          203 kB
          Szymon Chojnacki
        18. ec2_setup_notes.txt
          6 kB
          Timothy Potter
        19. TamingAnalyzerTest.java
          1 kB
          Timothy Potter
        20. TamingAnalyzer.java
          2 kB
          Timothy Potter
        21. 60_clusters_kmeans_10_iterations_100K_coordinates.txt
          7 kB
          Szymon Chojnacki
        22. TamingSubset.java
          2 kB
          Szymon Chojnacki
        23. TamingSubsetMapper.java
          0.9 kB
          Szymon Chojnacki
        24. mahout-588_distribution.pdf
          311 kB
          Szymon Chojnacki
        25. prep_asf_mail_archives.sh
          3 kB
          Timothy Potter
        26. ec2_setup_notes_v2.txt
          6 kB
          Timothy Potter
        27. prep_asf_mail_archives.sh
          3 kB
          Timothy Potter
        28. ec2_setup_notes_v2.txt
          6 kB
          Timothy Potter
        29. mahout-588_canopy.pdf
          161 kB
          Szymon Chojnacki
        30. MailArchivesClusteringAnalyzer.java
          8 kB
          Timothy Potter
        31. SequenceFilesFromMailArchives.java
          12 kB
          Timothy Potter
        32. MailArchivesClusteringAnalyzerTest.java
          2 kB
          Timothy Potter
        33. SequenceFilesFromMailArchivesTest.java
          7 kB
          Timothy Potter
        34. MAHOUT-588.patch
          35 kB
          Timothy Potter
        35. prep_asf_mail_archives.sh
          4 kB
          Timothy Potter

        Issue Links

          Activity

            People

              gsingers Grant Ingersoll
              gsingers Grant Ingersoll
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: