Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-588

Benchmark Mahout's clustering performance on EC2 and publish the results

    XMLWordPrintableJSON

Details

    • Task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.5
    • 0.5
    • None
    • None

    Description

      For Taming Text, I've commissioned some benchmarking work on Mahout's clustering algorithms. I've asked the two doing the project to do all the work in the open here. The goal is to use a publicly reusable dataset (for now, the ASF mail archives, assuming it is big enough) and run on EC2 and make all resources available so others can reproduce/improve.

      I'd like to add the setup code to utils (although it could possibly be done as a Vectorizer) and the publication of the results will be put up on the Wiki as well as in the book. This issue is to track the patches, etc.

      Attachments

        1. 60_clusters_kmeans_10_iterations_100K_coordinates.txt
          7 kB
          Szymon Chojnacki
        2. clusters_kMeans.txt
          11 kB
          Szymon Chojnacki
        3. clusters1.txt
          203 kB
          Szymon Chojnacki
        4. distcp_large_to_s3_failed.log
          47 kB
          Timothy Potter
        5. ec2_setup_notes_v2.txt
          6 kB
          Timothy Potter
        6. ec2_setup_notes_v2.txt
          6 kB
          Timothy Potter
        7. ec2_setup_notes.txt
          6 kB
          Timothy Potter
        8. mahout-588_canopy.pdf
          161 kB
          Szymon Chojnacki
        9. mahout-588_distribution.pdf
          311 kB
          Szymon Chojnacki
        10. MAHOUT-588.patch
          35 kB
          Timothy Potter
        11. MailArchivesClusteringAnalyzer.java
          8 kB
          Timothy Potter
        12. MailArchivesClusteringAnalyzerTest.java
          2 kB
          Timothy Potter
        13. prep_asf_mail_archives.sh
          4 kB
          Timothy Potter
        14. prep_asf_mail_archives.sh
          3 kB
          Timothy Potter
        15. prep_asf_mail_archives.sh
          3 kB
          Timothy Potter
        16. seq2sparse_small_failed.log
          118 kB
          Timothy Potter
        17. seq2sparse_xlarge_ok.log
          230 kB
          Timothy Potter
        18. SequenceFilesFromMailArchives.java
          12 kB
          Timothy Potter
        19. SequenceFilesFromMailArchives.java
          12 kB
          Timothy Potter
        20. SequenceFilesFromMailArchives2.java
          10 kB
          Szymon Chojnacki
        21. SequenceFilesFromMailArchivesTest.java
          7 kB
          Timothy Potter
        22. TamingAnalyzer.java
          2 kB
          Timothy Potter
        23. TamingAnalyzer.java
          3 kB
          Szymon Chojnacki
        24. TamingAnalyzerTest.java
          1 kB
          Timothy Potter
        25. TamingCollocDriver.java
          10 kB
          Szymon Chojnacki
        26. TamingCollocMapper.java
          7 kB
          Szymon Chojnacki
        27. TamingDictionaryVectorizer.java
          14 kB
          Szymon Chojnacki
        28. TamingDictVect.java
          1 kB
          Szymon Chojnacki
        29. TamingGramKeyGroupComparator.java
          0.7 kB
          Szymon Chojnacki
        30. TamingSubset.java
          2 kB
          Szymon Chojnacki
        31. TamingSubsetMapper.java
          0.9 kB
          Szymon Chojnacki
        32. TamingTFIDF.java
          0.9 kB
          Szymon Chojnacki
        33. TamingTokenizer.java
          0.8 kB
          Szymon Chojnacki
        34. Top1000Tokens_maybe_stopWords
          14 kB
          Szymon Chojnacki
        35. Uncompress.java
          4 kB
          Szymon Chojnacki

        Issue Links

          Activity

            People

              gsingers Grant Ingersoll
              gsingers Grant Ingersoll
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: