Mahout
  1. Mahout
  2. MAHOUT-588

Benchmark Mahout's clustering performance on EC2 and publish the results

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.5
    • Fix Version/s: 0.5
    • Component/s: None
    • Labels:
      None

      Description

      For Taming Text, I've commissioned some benchmarking work on Mahout's clustering algorithms. I've asked the two doing the project to do all the work in the open here. The goal is to use a publicly reusable dataset (for now, the ASF mail archives, assuming it is big enough) and run on EC2 and make all resources available so others can reproduce/improve.

      I'd like to add the setup code to utils (although it could possibly be done as a Vectorizer) and the publication of the results will be put up on the Wiki as well as in the book. This issue is to track the patches, etc.

      1. Uncompress.java
        4 kB
        Szymon Chojnacki
      2. Top1000Tokens_maybe_stopWords
        14 kB
        Szymon Chojnacki
      3. TamingTokenizer.java
        0.8 kB
        Szymon Chojnacki
      4. TamingTFIDF.java
        0.9 kB
        Szymon Chojnacki
      5. TamingSubsetMapper.java
        0.9 kB
        Szymon Chojnacki
      6. TamingSubset.java
        2 kB
        Szymon Chojnacki
      7. TamingGramKeyGroupComparator.java
        0.7 kB
        Szymon Chojnacki
      8. TamingDictVect.java
        1 kB
        Szymon Chojnacki
      9. TamingDictionaryVectorizer.java
        14 kB
        Szymon Chojnacki
      10. TamingCollocMapper.java
        7 kB
        Szymon Chojnacki
      11. TamingCollocDriver.java
        10 kB
        Szymon Chojnacki
      12. TamingAnalyzerTest.java
        1 kB
        Timothy Potter
      13. TamingAnalyzer.java
        3 kB
        Szymon Chojnacki
      14. TamingAnalyzer.java
        2 kB
        Timothy Potter
      15. SequenceFilesFromMailArchivesTest.java
        7 kB
        Timothy Potter
      16. SequenceFilesFromMailArchives2.java
        10 kB
        Szymon Chojnacki
      17. SequenceFilesFromMailArchives.java
        12 kB
        Timothy Potter
      18. SequenceFilesFromMailArchives.java
        12 kB
        Timothy Potter
      19. seq2sparse_xlarge_ok.log
        230 kB
        Timothy Potter
      20. seq2sparse_small_failed.log
        118 kB
        Timothy Potter
      21. prep_asf_mail_archives.sh
        3 kB
        Timothy Potter
      22. prep_asf_mail_archives.sh
        3 kB
        Timothy Potter
      23. prep_asf_mail_archives.sh
        4 kB
        Timothy Potter
      24. MailArchivesClusteringAnalyzerTest.java
        2 kB
        Timothy Potter
      25. MailArchivesClusteringAnalyzer.java
        8 kB
        Timothy Potter
      26. MAHOUT-588.patch
        35 kB
        Timothy Potter
      27. mahout-588_distribution.pdf
        311 kB
        Szymon Chojnacki
      28. mahout-588_canopy.pdf
        161 kB
        Szymon Chojnacki
      29. ec2_setup_notes.txt
        6 kB
        Timothy Potter
      30. ec2_setup_notes_v2.txt
        6 kB
        Timothy Potter
      31. ec2_setup_notes_v2.txt
        6 kB
        Timothy Potter
      32. distcp_large_to_s3_failed.log
        47 kB
        Timothy Potter
      33. clusters1.txt
        203 kB
        Szymon Chojnacki
      34. clusters_kMeans.txt
        11 kB
        Szymon Chojnacki
      35. 60_clusters_kmeans_10_iterations_100K_coordinates.txt
        7 kB
        Szymon Chojnacki

        Issue Links

          Activity

          Grant Ingersoll created issue -
          Grant Ingersoll made changes -
          Field Original Value New Value
          Link This issue is blocked by MAHOUT-598 [ MAHOUT-598 ]
          Timothy Potter made changes -
          Attachment SequenceFilesFromMailArchives.java [ 12469624 ]
          Szymon Chojnacki made changes -
          Attachment Uncompress.java [ 12469674 ]
          Szymon Chojnacki made changes -
          Attachment SequenceFilesFromMailArchives2.java [ 12469676 ]
          Timothy Potter made changes -
          Attachment seq2sparse_xlarge_ok.log [ 12469774 ]
          Attachment seq2sparse_small_failed.log [ 12469775 ]
          Attachment distcp_large_to_s3_failed.log [ 12469776 ]
          Szymon Chojnacki made changes -
          Attachment Top1000Tokens_maybe_stopWords [ 12469923 ]
          Szymon Chojnacki made changes -
          Attachment clusters_kMeans.txt [ 12469924 ]
          Szymon Chojnacki made changes -
          Attachment TamingAnalyzer.java [ 12469968 ]
          Szymon Chojnacki made changes -
          Attachment TamingTokenizer.java [ 12470134 ]
          Attachment TamingAnalyzer.java [ 12470135 ]
          Szymon Chojnacki made changes -
          Attachment TamingAnalyzer.java [ 12469968 ]
          Szymon Chojnacki made changes -
          Attachment TamingDictVect.java [ 12470136 ]
          Attachment TamingCollocDriver.java [ 12470137 ]
          Attachment TamingCollocMapper.java [ 12470138 ]
          Attachment TamingGramKeyGroupComparator.java [ 12470139 ]
          Attachment TamingDictionaryVectorizer.java [ 12470140 ]
          Szymon Chojnacki made changes -
          Attachment TamingCollocMapper.java [ 12470138 ]
          Szymon Chojnacki made changes -
          Attachment TamingCollocMapper.java [ 12470141 ]
          Szymon Chojnacki made changes -
          Attachment TamingTFIDF.java [ 12470142 ]
          Szymon Chojnacki made changes -
          Attachment clusters1.txt [ 12470145 ]
          Timothy Potter made changes -
          Attachment ec2_setup_notes.txt [ 12470406 ]
          Attachment TamingAnalyzerTest.java [ 12470407 ]
          Attachment TamingAnalyzer.java [ 12470408 ]
          Szymon Chojnacki made changes -
          Szymon Chojnacki made changes -
          Attachment TamingSubset.java [ 12471605 ]
          Attachment TamingSubsetMapper.java [ 12471606 ]
          Szymon Chojnacki made changes -
          Attachment mahout-588_distribution.pdf [ 12471836 ]
          Timothy Potter made changes -
          Attachment prep_asf_mail_archives.sh [ 12471986 ]
          Attachment ec2_setup_notes_v2.txt [ 12471987 ]
          Timothy Potter made changes -
          Attachment prep_asf_mail_archives.sh [ 12471988 ]
          Attachment ec2_setup_notes_v2.txt [ 12471989 ]
          Szymon Chojnacki made changes -
          Attachment mahout-588_canopy.pdf [ 12472217 ]
          Timothy Potter made changes -
          Attachment SequenceFilesFromMailArchives.java [ 12472787 ]
          Attachment MailArchivesClusteringAnalyzerTest.java [ 12472788 ]
          Attachment SequenceFilesFromMailArchivesTest.java [ 12472789 ]
          Attachment MailArchivesClusteringAnalyzer.java [ 12472786 ]
          Grant Ingersoll made changes -
          Link This issue is related to MAHOUT-500 [ MAHOUT-500 ]
          Timothy Potter made changes -
          Attachment MAHOUT-588.patch [ 12474465 ]
          Timothy Potter made changes -
          Attachment prep_asf_mail_archives.sh [ 12475020 ]
          Sean Owen made changes -
          Assignee Grant Ingersoll [ gsingers ]
          Fix Version/s 0.6 [ 12316364 ]
          Affects Version/s 0.5 [ 12315255 ]
          Grant Ingersoll made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Fix Version/s 0.5 [ 12315255 ]
          Fix Version/s 0.6 [ 12316364 ]
          Resolution Fixed [ 1 ]
          Sean Owen made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Oliver B. Fischer made changes -
          Link This issue is related to MAHOUT-670 [ MAHOUT-670 ]

            People

            • Assignee:
              Grant Ingersoll
              Reporter:
              Grant Ingersoll
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development