Mahout
  1. Mahout
  2. MAHOUT-588

Benchmark Mahout's clustering performance on EC2 and publish the results

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.5
    • Fix Version/s: 0.5
    • Component/s: None
    • Labels:
      None

      Description

      For Taming Text, I've commissioned some benchmarking work on Mahout's clustering algorithms. I've asked the two doing the project to do all the work in the open here. The goal is to use a publicly reusable dataset (for now, the ASF mail archives, assuming it is big enough) and run on EC2 and make all resources available so others can reproduce/improve.

      I'd like to add the setup code to utils (although it could possibly be done as a Vectorizer) and the publication of the results will be put up on the Wiki as well as in the book. This issue is to track the patches, etc.

      1. prep_asf_mail_archives.sh
        4 kB
        Timothy Potter
      2. MAHOUT-588.patch
        35 kB
        Timothy Potter
      3. SequenceFilesFromMailArchivesTest.java
        7 kB
        Timothy Potter
      4. MailArchivesClusteringAnalyzerTest.java
        2 kB
        Timothy Potter
      5. SequenceFilesFromMailArchives.java
        12 kB
        Timothy Potter
      6. MailArchivesClusteringAnalyzer.java
        8 kB
        Timothy Potter
      7. mahout-588_canopy.pdf
        161 kB
        Szymon Chojnacki
      8. ec2_setup_notes_v2.txt
        6 kB
        Timothy Potter
      9. prep_asf_mail_archives.sh
        3 kB
        Timothy Potter
      10. ec2_setup_notes_v2.txt
        6 kB
        Timothy Potter
      11. prep_asf_mail_archives.sh
        3 kB
        Timothy Potter
      12. mahout-588_distribution.pdf
        311 kB
        Szymon Chojnacki
      13. TamingSubsetMapper.java
        0.9 kB
        Szymon Chojnacki
      14. TamingSubset.java
        2 kB
        Szymon Chojnacki
      15. 60_clusters_kmeans_10_iterations_100K_coordinates.txt
        7 kB
        Szymon Chojnacki
      16. TamingAnalyzer.java
        2 kB
        Timothy Potter
      17. TamingAnalyzerTest.java
        1 kB
        Timothy Potter
      18. ec2_setup_notes.txt
        6 kB
        Timothy Potter
      19. clusters1.txt
        203 kB
        Szymon Chojnacki
      20. TamingTFIDF.java
        0.9 kB
        Szymon Chojnacki
      21. TamingCollocMapper.java
        7 kB
        Szymon Chojnacki
      22. TamingDictionaryVectorizer.java
        14 kB
        Szymon Chojnacki
      23. TamingGramKeyGroupComparator.java
        0.7 kB
        Szymon Chojnacki
      24. TamingCollocDriver.java
        10 kB
        Szymon Chojnacki
      25. TamingDictVect.java
        1 kB
        Szymon Chojnacki
      26. TamingAnalyzer.java
        3 kB
        Szymon Chojnacki
      27. TamingTokenizer.java
        0.8 kB
        Szymon Chojnacki
      28. clusters_kMeans.txt
        11 kB
        Szymon Chojnacki
      29. Top1000Tokens_maybe_stopWords
        14 kB
        Szymon Chojnacki
      30. distcp_large_to_s3_failed.log
        47 kB
        Timothy Potter
      31. seq2sparse_small_failed.log
        118 kB
        Timothy Potter
      32. seq2sparse_xlarge_ok.log
        230 kB
        Timothy Potter
      33. SequenceFilesFromMailArchives2.java
        10 kB
        Szymon Chojnacki
      34. Uncompress.java
        4 kB
        Szymon Chojnacki
      35. SequenceFilesFromMailArchives.java
        12 kB
        Timothy Potter

        Issue Links

          Activity

          No work has yet been logged on this issue.

            People

            • Assignee:
              Grant Ingersoll
              Reporter:
              Grant Ingersoll
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development