[MAHOUT-466] simplify or alternative Similarity arithmetic(AbstractDistributedVectorSimilarity) for boolean data - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Not A Problem
Affects Version/s: 0.4
Fix Version/s: 0.4
Component/s: None
Labels:
None

Description

For boolean data ,the prefValue is always 1.0f, We need simplify Similarity arithmetic

for example:
1) DistributedEuclideanDistanceVectorSimilarity

package org.apache.mahout.math.hadoop.similarity.vector;

import org.apache.mahout.math.hadoop.similarity.Cooccurrence;

/**

distributed implementation of euclidean distance as vector similarity measure
*/
public class DistributedEuclideanDistanceVectorSimilarity extends AbstractDistributedVectorSimilarity {

@Override
protected double doComputeResult(int rowA, int rowB, Iterable<Cooccurrence> cooccurrences, double weightOfVectorA,
double weightOfVectorB, int numberOfColumns) {

double n = 0.0;
double sumXYdiff2 = 0.0;

for (Cooccurrence cooccurrence : cooccurrences)

{ double diff = cooccurrence.getValueA() - cooccurrence.getValueB(); sumXYdiff2 += diff * diff; n++; }

return n / (1.0 + Math.sqrt(sumXYdiff2));
}

}

this one is always return n (=cooccurrence.size())
2) DistributedUncenteredCosineVectorSimilarity
/**

distributed implementation of cosine similarity that does not center its data
*/
public class DistributedUncenteredCosineVectorSimilarity extends AbstractDistributedVectorSimilarity {

@Override
protected double doComputeResult(int rowA, int rowB, Iterable<Cooccurrence> cooccurrences, double weightOfVectorA,
double weightOfVectorB, int numberOfColumns) {

int n = 0;
double sumXY = 0.0;
double sumX2 = 0.0;
double sumY2 = 0.0;

for (Cooccurrence cooccurrence : cooccurrences)

{ double x = cooccurrence.getValueA(); double y = cooccurrence.getValueB(); sumXY += x * y; sumX2 += x * x; sumY2 += y * y; n++; }

if (n == 0)

{ return Double.NaN; }

double denominator = Math.sqrt(sumX2) * Math.sqrt(sumY2);
if (denominator == 0.0)

{ // One or both vectors has -all- the same values; // can't really say much similarity under this measure return Double.NaN; }

return sumXY / denominator;
}

}

this one will always return 1.0
3) DistributedUncenteredZeroAssumingCosineVectorSimilarity
If n users like ItemA, m users like ItemB,p users like both ItemA and ItemB,

DistributedUncenteredZeroAssumingCosineVectorSimilarity return p/(m*n).

it also can use for Boolean data, but we can provide a simple one , return (p*p)/(m*n),no so much computing.

Attachments

Activity

People

Assignee:: Sean R. Owen

Reporter:: Han Hui Wen

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 12/Aug/10 12:41

Updated:: 31/Jan/24 22:11

Resolved:: 13/Aug/10 03:45