[MAHOUT-418] Computing the pairwise similarities of the rows of a matrix - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.3
Fix Version/s: 0.4
Component/s: classic
Labels:
None

Description

In response to the wish from ~~MAHOUT-362~~ and the latest discussion on the mailing list started by Kris Jack about computing a document similarity matrix, I tried to generalize the approach we're already using to compute the item-item-similarities for collaborative filtering.

The job in the patch computes the pairwise similarity of the rows of a matrix in a distributed manner, is uses a SequenceFile<IntWritable,VectorWritable> as input and outputs such a file too. Custom similarity implementations can be supplied, I've already implemented tanimoto and cosine for demo and testing purposes. The algorithm is based on the one presented here: http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf

I'd be glad if someone could verify the applicability of this approach by running it with a reasonably large input, I'm also worried that it might buffer to much data in certain steps.

If you decide to include it in mahout, some more efforts and decisions (like more tests, more similarity measures, integration with DistributedRowMatrix) would need to be made, I guess.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MAHOUT-418.patch
17/Jun/10 11:16
44 kB
Sebastian Schelter
MAHOUT-418-2.patch
21/Jun/10 09:19
44 kB
Sebastian Schelter
MAHOUT-418-3.patch
26/Jun/10 00:43
217 kB
Sebastian Schelter

Activity

People

Assignee:: Unassigned

Reporter:: Sebastian Schelter

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 17/Jun/10 11:14

Updated:: 31/Jan/24 22:16

Resolved:: 28/Jun/10 09:43