Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-588

Benchmark Mahout's clustering performance on EC2 and publish the results

    XMLWordPrintableJSON

Details

    • Task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.5
    • 0.5
    • None
    • None

    Description

      For Taming Text, I've commissioned some benchmarking work on Mahout's clustering algorithms. I've asked the two doing the project to do all the work in the open here. The goal is to use a publicly reusable dataset (for now, the ASF mail archives, assuming it is big enough) and run on EC2 and make all resources available so others can reproduce/improve.

      I'd like to add the setup code to utils (although it could possibly be done as a Vectorizer) and the publication of the results will be put up on the Wiki as well as in the book. This issue is to track the patches, etc.

      Attachments

        1. Uncompress.java
          4 kB
          Szymon Chojnacki
        2. Top1000Tokens_maybe_stopWords
          14 kB
          Szymon Chojnacki
        3. TamingTokenizer.java
          0.8 kB
          Szymon Chojnacki
        4. TamingTFIDF.java
          0.9 kB
          Szymon Chojnacki
        5. TamingSubsetMapper.java
          0.9 kB
          Szymon Chojnacki
        6. TamingSubset.java
          2 kB
          Szymon Chojnacki
        7. TamingGramKeyGroupComparator.java
          0.7 kB
          Szymon Chojnacki
        8. TamingDictVect.java
          1 kB
          Szymon Chojnacki
        9. TamingDictionaryVectorizer.java
          14 kB
          Szymon Chojnacki
        10. TamingCollocMapper.java
          7 kB
          Szymon Chojnacki
        11. TamingCollocDriver.java
          10 kB
          Szymon Chojnacki
        12. TamingAnalyzerTest.java
          1 kB
          Timothy Potter
        13. TamingAnalyzer.java
          3 kB
          Szymon Chojnacki
        14. TamingAnalyzer.java
          2 kB
          Timothy Potter
        15. SequenceFilesFromMailArchivesTest.java
          7 kB
          Timothy Potter
        16. SequenceFilesFromMailArchives2.java
          10 kB
          Szymon Chojnacki
        17. SequenceFilesFromMailArchives.java
          12 kB
          Timothy Potter
        18. SequenceFilesFromMailArchives.java
          12 kB
          Timothy Potter
        19. seq2sparse_xlarge_ok.log
          230 kB
          Timothy Potter
        20. seq2sparse_small_failed.log
          118 kB
          Timothy Potter
        21. prep_asf_mail_archives.sh
          3 kB
          Timothy Potter
        22. prep_asf_mail_archives.sh
          3 kB
          Timothy Potter
        23. prep_asf_mail_archives.sh
          4 kB
          Timothy Potter
        24. MailArchivesClusteringAnalyzerTest.java
          2 kB
          Timothy Potter
        25. MailArchivesClusteringAnalyzer.java
          8 kB
          Timothy Potter
        26. MAHOUT-588.patch
          35 kB
          Timothy Potter
        27. mahout-588_distribution.pdf
          311 kB
          Szymon Chojnacki
        28. mahout-588_canopy.pdf
          161 kB
          Szymon Chojnacki
        29. ec2_setup_notes.txt
          6 kB
          Timothy Potter
        30. ec2_setup_notes_v2.txt
          6 kB
          Timothy Potter
        31. ec2_setup_notes_v2.txt
          6 kB
          Timothy Potter
        32. distcp_large_to_s3_failed.log
          47 kB
          Timothy Potter
        33. clusters1.txt
          203 kB
          Szymon Chojnacki
        34. clusters_kMeans.txt
          11 kB
          Szymon Chojnacki
        35. 60_clusters_kmeans_10_iterations_100K_coordinates.txt
          7 kB
          Szymon Chojnacki

        Issue Links

          Activity

            People

              gsingers Grant Ingersoll
              gsingers Grant Ingersoll
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: