Mahout
  1. Mahout
  2. MAHOUT-588

Benchmark Mahout's clustering performance on EC2 and publish the results

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.5
    • Fix Version/s: 0.5
    • Component/s: None
    • Labels:
      None

      Description

      For Taming Text, I've commissioned some benchmarking work on Mahout's clustering algorithms. I've asked the two doing the project to do all the work in the open here. The goal is to use a publicly reusable dataset (for now, the ASF mail archives, assuming it is big enough) and run on EC2 and make all resources available so others can reproduce/improve.

      I'd like to add the setup code to utils (although it could possibly be done as a Vectorizer) and the publication of the results will be put up on the Wiki as well as in the book. This issue is to track the patches, etc.

      1. 60_clusters_kmeans_10_iterations_100K_coordinates.txt
        7 kB
        Szymon Chojnacki
      2. clusters_kMeans.txt
        11 kB
        Szymon Chojnacki
      3. clusters1.txt
        203 kB
        Szymon Chojnacki
      4. distcp_large_to_s3_failed.log
        47 kB
        Timothy Potter
      5. ec2_setup_notes_v2.txt
        6 kB
        Timothy Potter
      6. ec2_setup_notes_v2.txt
        6 kB
        Timothy Potter
      7. ec2_setup_notes.txt
        6 kB
        Timothy Potter
      8. mahout-588_canopy.pdf
        161 kB
        Szymon Chojnacki
      9. mahout-588_distribution.pdf
        311 kB
        Szymon Chojnacki
      10. MAHOUT-588.patch
        35 kB
        Timothy Potter
      11. MailArchivesClusteringAnalyzer.java
        8 kB
        Timothy Potter
      12. MailArchivesClusteringAnalyzerTest.java
        2 kB
        Timothy Potter
      13. prep_asf_mail_archives.sh
        4 kB
        Timothy Potter
      14. prep_asf_mail_archives.sh
        3 kB
        Timothy Potter
      15. prep_asf_mail_archives.sh
        3 kB
        Timothy Potter
      16. seq2sparse_small_failed.log
        118 kB
        Timothy Potter
      17. seq2sparse_xlarge_ok.log
        230 kB
        Timothy Potter
      18. SequenceFilesFromMailArchives.java
        12 kB
        Timothy Potter
      19. SequenceFilesFromMailArchives.java
        12 kB
        Timothy Potter
      20. SequenceFilesFromMailArchives2.java
        10 kB
        Szymon Chojnacki
      21. SequenceFilesFromMailArchivesTest.java
        7 kB
        Timothy Potter
      22. TamingAnalyzer.java
        2 kB
        Timothy Potter
      23. TamingAnalyzer.java
        3 kB
        Szymon Chojnacki
      24. TamingAnalyzerTest.java
        1 kB
        Timothy Potter
      25. TamingCollocDriver.java
        10 kB
        Szymon Chojnacki
      26. TamingCollocMapper.java
        7 kB
        Szymon Chojnacki
      27. TamingDictionaryVectorizer.java
        14 kB
        Szymon Chojnacki
      28. TamingDictVect.java
        1 kB
        Szymon Chojnacki
      29. TamingGramKeyGroupComparator.java
        0.7 kB
        Szymon Chojnacki
      30. TamingSubset.java
        2 kB
        Szymon Chojnacki
      31. TamingSubsetMapper.java
        0.9 kB
        Szymon Chojnacki
      32. TamingTFIDF.java
        0.9 kB
        Szymon Chojnacki
      33. TamingTokenizer.java
        0.8 kB
        Szymon Chojnacki
      34. Top1000Tokens_maybe_stopWords
        14 kB
        Szymon Chojnacki
      35. Uncompress.java
        4 kB
        Szymon Chojnacki

        Issue Links

          Activity

            People

            • Assignee:
              Grant Ingersoll
              Reporter:
              Grant Ingersoll
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development