Mahout
  1. Mahout
  2. MAHOUT-738

Collocation driver has long being statically cast to an int

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.5
    • Fix Version/s: 0.6
    • Component/s: Math
    • Labels:
      None

      Description

      org.apache.mahout.vectorizer.collocations.llr.LLRReducer, which is part of the collocation driver, statically casts a long to an int.

      private long ngramTotal;
      ...
      int k11 = ngram.getFrequency(); /* a&b */
      int k12 = gramFreq[0] - ngram.getFrequency(); /* a&!b */
      int k21 = gramFreq[1] - ngram.getFrequency(); /* !b&a */
      int k22 = (int) (ngramTotal - (gramFreq[0] + gramFreq[1] - ngram.getFrequency())); /* !a&!b */

      These numbers are then fed into

      org.apache.mahout.math.stats.LogLikelihood

      specifically the function below.

      public static double logLikelihoodRatio(int k11, int k12, int k21, int k22) {
      // note that we have counts here, not probabilities, and that the entropy is not normalized.
      double rowEntropy = entropy(k11, k12) + entropy(k21, k22);
      double columnEntropy = entropy(k11, k21) + entropy(k12, k22);
      double matrixEntropy = entropy(k11, k12, k21, k22);
      if (rowEntropy + columnEntropy > matrixEntropy)

      { // round off error return 0.0; }

      return 2.0 * (matrixEntropy - rowEntropy - columnEntropy);
      }

      In short if the long ngramTotal is larger than Integer.MAX_VALUE (which will happen in large datasets), then the driver will either crash or in the case that it casts to a negative int, will continue as usual but produce no output due to error checking.

        Activity

        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Patch Available Patch Available
        17h 40m 1 Sean Owen 22/Jun/11 09:49
        Patch Available Patch Available Resolved Resolved
        8h 15m 1 Sean Owen 22/Jun/11 18:05
        Resolved Resolved Closed Closed
        231d 20h 54m 1 Sean Owen 09/Feb/12 14:00
        Sean Owen made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #897 (See https://builds.apache.org/job/Mahout-Quality/897/)
        MAHOUT-738 treat input to LLR as long to avoid possible overflow

        srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1138553
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedUncenteredZeroAssumingCosineVectorSimilarity.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedPearsonCorrelationVectorSimilarity.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedLoglikelihoodVectorSimilarity.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/collocations/llr/LLRReducer.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedEuclideanDistanceVectorSimilarity.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/similarity/LogLikelihoodSimilarity.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/vectors/lucene/ClusterLabels.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedTanimotoCoefficientVectorSimilarity.java
        • /mahout/trunk/math/src/main/java/org/apache/mahout/math/stats/LogLikelihood.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedUncenteredCosineVectorSimilarity.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedCityBlockVectorSimilarity.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedVectorSimilarity.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/collocations/llr/LLRReducerTest.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/AbstractDistributedVectorSimilarity.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedCooccurrenceVectorSimilarity.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/similarity/TestRowSimilarityJob.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #897 (See https://builds.apache.org/job/Mahout-Quality/897/ ) MAHOUT-738 treat input to LLR as long to avoid possible overflow srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1138553 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedUncenteredZeroAssumingCosineVectorSimilarity.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedPearsonCorrelationVectorSimilarity.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedLoglikelihoodVectorSimilarity.java /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/collocations/llr/LLRReducer.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedEuclideanDistanceVectorSimilarity.java /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/similarity/LogLikelihoodSimilarity.java /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/vectors/lucene/ClusterLabels.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedTanimotoCoefficientVectorSimilarity.java /mahout/trunk/math/src/main/java/org/apache/mahout/math/stats/LogLikelihood.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedUncenteredCosineVectorSimilarity.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedCityBlockVectorSimilarity.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedVectorSimilarity.java /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/collocations/llr/LLRReducerTest.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/AbstractDistributedVectorSimilarity.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedCooccurrenceVectorSimilarity.java /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/similarity/TestRowSimilarityJob.java
        Sean Owen made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Sean Owen made changes -
        Attachment MAHOUT-739.patch [ 12483417 ]
        Sean Owen made changes -
        Field Original Value New Value
        Status Open [ 1 ] Patch Available [ 10002 ]
        Assignee Sean Owen [ srowen ]
        Fix Version/s 0.6 [ 12316364 ]
        Hide
        peter andrews added a comment -

        Pretty much what I was thinking.

        Show
        peter andrews added a comment - Pretty much what I was thinking.
        Hide
        Sean Owen added a comment -

        How about we just make this method operate on longs? seems like the easiest and most straightforward thing.

        Show
        Sean Owen added a comment - How about we just make this method operate on longs? seems like the easiest and most straightforward thing.
        peter andrews created issue -

          People

          • Assignee:
            Sean Owen
            Reporter:
            peter andrews
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development