Mahout
  1. Mahout
  2. MAHOUT-738

Collocation driver has long being statically cast to an int

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.5
    • Fix Version/s: 0.6
    • Component/s: Math
    • Labels:
      None

      Description

      org.apache.mahout.vectorizer.collocations.llr.LLRReducer, which is part of the collocation driver, statically casts a long to an int.

      private long ngramTotal;
      ...
      int k11 = ngram.getFrequency(); /* a&b */
      int k12 = gramFreq[0] - ngram.getFrequency(); /* a&!b */
      int k21 = gramFreq[1] - ngram.getFrequency(); /* !b&a */
      int k22 = (int) (ngramTotal - (gramFreq[0] + gramFreq[1] - ngram.getFrequency())); /* !a&!b */

      These numbers are then fed into

      org.apache.mahout.math.stats.LogLikelihood

      specifically the function below.

      public static double logLikelihoodRatio(int k11, int k12, int k21, int k22) {
      // note that we have counts here, not probabilities, and that the entropy is not normalized.
      double rowEntropy = entropy(k11, k12) + entropy(k21, k22);
      double columnEntropy = entropy(k11, k21) + entropy(k12, k22);
      double matrixEntropy = entropy(k11, k12, k21, k22);
      if (rowEntropy + columnEntropy > matrixEntropy)

      { // round off error return 0.0; }

      return 2.0 * (matrixEntropy - rowEntropy - columnEntropy);
      }

      In short if the long ngramTotal is larger than Integer.MAX_VALUE (which will happen in large datasets), then the driver will either crash or in the case that it casts to a negative int, will continue as usual but produce no output due to error checking.

        Activity

        peter andrews created issue -
        Hide
        Sean Owen added a comment -

        How about we just make this method operate on longs? seems like the easiest and most straightforward thing.

        Show
        Sean Owen added a comment - How about we just make this method operate on longs? seems like the easiest and most straightforward thing.
        Hide
        peter andrews added a comment -

        Pretty much what I was thinking.

        Show
        peter andrews added a comment - Pretty much what I was thinking.
        Sean Owen made changes -
        Field Original Value New Value
        Status Open [ 1 ] Patch Available [ 10002 ]
        Assignee Sean Owen [ srowen ]
        Fix Version/s 0.6 [ 12316364 ]
        Sean Owen made changes -
        Attachment MAHOUT-739.patch [ 12483417 ]
        Sean Owen made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #897 (See https://builds.apache.org/job/Mahout-Quality/897/)
        MAHOUT-738 treat input to LLR as long to avoid possible overflow

        srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1138553
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedUncenteredZeroAssumingCosineVectorSimilarity.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedPearsonCorrelationVectorSimilarity.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedLoglikelihoodVectorSimilarity.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/collocations/llr/LLRReducer.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedEuclideanDistanceVectorSimilarity.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/similarity/LogLikelihoodSimilarity.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/vectors/lucene/ClusterLabels.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedTanimotoCoefficientVectorSimilarity.java
        • /mahout/trunk/math/src/main/java/org/apache/mahout/math/stats/LogLikelihood.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedUncenteredCosineVectorSimilarity.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedCityBlockVectorSimilarity.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedVectorSimilarity.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/collocations/llr/LLRReducerTest.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/AbstractDistributedVectorSimilarity.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedCooccurrenceVectorSimilarity.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/similarity/TestRowSimilarityJob.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #897 (See https://builds.apache.org/job/Mahout-Quality/897/ ) MAHOUT-738 treat input to LLR as long to avoid possible overflow srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1138553 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedUncenteredZeroAssumingCosineVectorSimilarity.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedPearsonCorrelationVectorSimilarity.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedLoglikelihoodVectorSimilarity.java /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/collocations/llr/LLRReducer.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedEuclideanDistanceVectorSimilarity.java /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/similarity/LogLikelihoodSimilarity.java /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/vectors/lucene/ClusterLabels.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedTanimotoCoefficientVectorSimilarity.java /mahout/trunk/math/src/main/java/org/apache/mahout/math/stats/LogLikelihood.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedUncenteredCosineVectorSimilarity.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedCityBlockVectorSimilarity.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedVectorSimilarity.java /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/collocations/llr/LLRReducerTest.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/AbstractDistributedVectorSimilarity.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedCooccurrenceVectorSimilarity.java /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/similarity/TestRowSimilarityJob.java
        Sean Owen made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Patch Available Patch Available
        17h 40m 1 Sean Owen 22/Jun/11 09:49
        Patch Available Patch Available Resolved Resolved
        8h 15m 1 Sean Owen 22/Jun/11 18:05
        Resolved Resolved Closed Closed
        231d 20h 54m 1 Sean Owen 09/Feb/12 14:00

          People

          • Assignee:
            Sean Owen
            Reporter:
            peter andrews
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development