[MAHOUT-588] Benchmark Mahout's clustering performance on EC2 and publish the results - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.5
Fix Version/s: 0.5
Component/s: None
Labels:
None

Description

For Taming Text, I've commissioned some benchmarking work on Mahout's clustering algorithms. I've asked the two doing the project to do all the work in the open here. The goal is to use a publicly reusable dataset (for now, the ASF mail archives, assuming it is big enough) and run on EC2 and make all resources available so others can reproduce/improve.

I'd like to add the setup code to utils (although it could possibly be done as a Vectorizer) and the publication of the results will be put up on the Wiki as well as in the book. This issue is to track the patches, etc.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

60_clusters_kmeans_10_iterations_100K_coordinates.txt
08/Feb/11 18:16
7 kB
Szymon Chojnacki
clusters_kMeans.txt
01/Feb/11 13:15
11 kB
Szymon Chojnacki
clusters1.txt
03/Feb/11 14:49
203 kB
Szymon Chojnacki
distcp_large_to_s3_failed.log
30/Jan/11 16:12
47 kB
Timothy Potter
ec2_setup_notes_v2.txt
25/Feb/11 22:56
6 kB
Timothy Potter
ec2_setup_notes_v2.txt
25/Feb/11 22:53
6 kB
Timothy Potter
ec2_setup_notes.txt
06/Feb/11 19:29
6 kB
Timothy Potter
mahout-588_canopy.pdf
28/Feb/11 19:38
161 kB
Szymon Chojnacki
mahout-588_distribution.pdf
24/Feb/11 14:11
311 kB
Szymon Chojnacki
MAHOUT-588.patch
24/Mar/11 04:33
35 kB
Timothy Potter
MailArchivesClusteringAnalyzer.java
06/Mar/11 17:00
8 kB
Timothy Potter
MailArchivesClusteringAnalyzerTest.java
06/Mar/11 17:00
2 kB
Timothy Potter
prep_asf_mail_archives.sh
30/Mar/11 21:53
4 kB
Timothy Potter
prep_asf_mail_archives.sh
25/Feb/11 22:56
3 kB
Timothy Potter
prep_asf_mail_archives.sh
25/Feb/11 22:53
3 kB
Timothy Potter
seq2sparse_small_failed.log
30/Jan/11 16:12
118 kB
Timothy Potter
seq2sparse_xlarge_ok.log
30/Jan/11 16:12
230 kB
Timothy Potter
SequenceFilesFromMailArchives.java
06/Mar/11 17:00
12 kB
Timothy Potter
SequenceFilesFromMailArchives.java
28/Jan/11 03:15
12 kB
Timothy Potter
SequenceFilesFromMailArchives2.java
28/Jan/11 15:04
10 kB
Szymon Chojnacki
SequenceFilesFromMailArchivesTest.java
06/Mar/11 17:00
7 kB
Timothy Potter
TamingAnalyzer.java
06/Feb/11 19:29
2 kB
Timothy Potter
TamingAnalyzer.java
03/Feb/11 13:40
3 kB
Szymon Chojnacki
TamingAnalyzerTest.java
06/Feb/11 19:29
1 kB
Timothy Potter
TamingCollocDriver.java
03/Feb/11 13:47
10 kB
Szymon Chojnacki
TamingCollocMapper.java
03/Feb/11 13:57
7 kB
Szymon Chojnacki
TamingDictionaryVectorizer.java
03/Feb/11 13:47
14 kB
Szymon Chojnacki
TamingDictVect.java
03/Feb/11 13:47
1 kB
Szymon Chojnacki
TamingGramKeyGroupComparator.java
03/Feb/11 13:47
0.7 kB
Szymon Chojnacki
TamingSubset.java
22/Feb/11 11:13
2 kB
Szymon Chojnacki
TamingSubsetMapper.java
22/Feb/11 11:13
0.9 kB
Szymon Chojnacki
TamingTFIDF.java
03/Feb/11 13:59
0.9 kB
Szymon Chojnacki
TamingTokenizer.java
03/Feb/11 13:40
0.8 kB
Szymon Chojnacki
Top1000Tokens_maybe_stopWords
01/Feb/11 13:08
14 kB
Szymon Chojnacki
Uncompress.java
28/Jan/11 13:23
4 kB
Szymon Chojnacki

Issue Links

is blocked by

MAHOUT-598 Downstream steps in the seq2sparse job flow looking in wrong location for output from previous steps when running in Elastic MapReduce (EMR) cluster

Closed

is related to

MAHOUT-500 Make it easy to run Mahout on Amazon's Elastic Map Reduce

Closed

MAHOUT-670 Provide a performance measurement framework for Mahout

Closed

Activity

People

Assignee:: Grant Ingersoll

Reporter:: Grant Ingersoll

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 22/Jan/11 12:32

Updated:: 05/Jan/12 23:45

Resolved:: 22/May/11 16:02