Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-466

simplify or alternative Similarity arithmetic(AbstractDistributedVectorSimilarity) for boolean data

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Not A Problem
    • 0.4
    • 0.4
    • None
    • None

    Description

      For boolean data ,the prefValue is always 1.0f, We need simplify Similarity arithmetic

      for example:
      1) DistributedEuclideanDistanceVectorSimilarity

      package org.apache.mahout.math.hadoop.similarity.vector;

      import org.apache.mahout.math.hadoop.similarity.Cooccurrence;

      /**

      • distributed implementation of euclidean distance as vector similarity measure
        */
        public class DistributedEuclideanDistanceVectorSimilarity extends AbstractDistributedVectorSimilarity {

      @Override
      protected double doComputeResult(int rowA, int rowB, Iterable<Cooccurrence> cooccurrences, double weightOfVectorA,
      double weightOfVectorB, int numberOfColumns) {

      double n = 0.0;
      double sumXYdiff2 = 0.0;

      for (Cooccurrence cooccurrence : cooccurrences)

      { double diff = cooccurrence.getValueA() - cooccurrence.getValueB(); sumXYdiff2 += diff * diff; n++; }

      return n / (1.0 + Math.sqrt(sumXYdiff2));
      }

      }

      this one is always return n (=cooccurrence.size())
      2) DistributedUncenteredCosineVectorSimilarity
      /**

      • distributed implementation of cosine similarity that does not center its data
        */
        public class DistributedUncenteredCosineVectorSimilarity extends AbstractDistributedVectorSimilarity {

      @Override
      protected double doComputeResult(int rowA, int rowB, Iterable<Cooccurrence> cooccurrences, double weightOfVectorA,
      double weightOfVectorB, int numberOfColumns) {

      int n = 0;
      double sumXY = 0.0;
      double sumX2 = 0.0;
      double sumY2 = 0.0;

      for (Cooccurrence cooccurrence : cooccurrences)

      { double x = cooccurrence.getValueA(); double y = cooccurrence.getValueB(); sumXY += x * y; sumX2 += x * x; sumY2 += y * y; n++; }

      if (n == 0)

      { return Double.NaN; }

      double denominator = Math.sqrt(sumX2) * Math.sqrt(sumY2);
      if (denominator == 0.0)

      { // One or both vectors has -all- the same values; // can't really say much similarity under this measure return Double.NaN; }

      return sumXY / denominator;
      }

      }

      this one will always return 1.0
      3) DistributedUncenteredZeroAssumingCosineVectorSimilarity
      If n users like ItemA, m users like ItemB,p users like both ItemA and ItemB,

      DistributedUncenteredZeroAssumingCosineVectorSimilarity return p/(m*n).

      it also can use for Boolean data, but we can provide a simple one , return (p*p)/(m*n),no so much computing.

      Attachments

        Activity

          People

            srowen Sean R. Owen
            huiwenhan Han Hui Wen
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: