[MAHOUT-468] Performance of RowSimilarityJob is not good - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Test
Status: Closed
Priority: Major
Resolution: Not A Problem
Affects Version/s: 0.4
Fix Version/s: None
Component/s: None
Labels:
None

Description

I have done a test ,

Preferences records: 680,194
distinct users: 23,246
distinct items:437,569
SIMILARITY_CLASS_NAME=SIMILARITY_COOCCURRENCE

maybePruneItemUserMatrixPath:16.50M
weights:13.80M
pairwiseSimilarity:18.81G
Job RowSimilarityJob-RowWeightMapper-WeightedOccurrencesPerColumnReducer:used 32 sec
Job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer:used 4.30 hours

I think the reason may be following:
1) We used SequenceFileOutputFormat,it cause job can only be run by n ( n= Hadoop node counts ) mappers or reducers concurrently.
2) We stored redundant info.

for example :

the output of CooccurrencesMapper: (ItemIndexA,similarity),(ItemIndexA,ItemIndexB,similarity)

3) Some frequently used code
https://issues.apache.org/jira/browse/MAHOUT-467

4) allocate many local variable in loop (need confirm )

In Class DistributedUncenteredZeroAssumingCosineVectorSimilarity

@Override
public double weight(Vector v) {
double length = 0.0;

Iterator<Element> elemIterator = v.iterateNonZero();

while (elemIterator.hasNext())

{ double value = elemIterator.next().get(); //this one length += value * value; }

return Math.sqrt(length);
}

5) Maybe we need control the size of cooccurrences

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.jpg
13/Aug/10 03:32
105 kB
Han Hui Wen

Issue Links

duplicates

MAHOUT-460 Add "maxPreferencesPerItemConsidered" option to o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Han Hui Wen

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 12/Aug/10 14:07

Updated:: 31/Jan/24 22:11

Resolved:: 14/Aug/10 18:35