Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-937

Collocations Job Partitioner not being configured properly

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 0.5
    • Fix Version/s: 0.6
    • Component/s: None
    • Labels:
      None

      Description

      The first pass of the collocations discovery job (as described by CollocDriver.generateCollocations) uses the org.apache.mahout.vectorizer.collocations.llr.GramKeyPartitioner partitioner.

      This partitoner has an instance variable offset that is supposed to be set by a call to setOffsets() but this call is never made (not sure why? is this method expected to be called by the Hadoop framework itself?)

      The offset not being set results in getPartition always returning 0 and so all intermediate data is sent to the one reducer.

      I couldn't quite understand what this partitioning was meant to be doing, but simply hashing the Grams primary string representation (ie without the leading 'type' byte) does what is required...

      public class GramKeyPartitioner extends Partitioner<GramKey, Gram> {
      
        @Override
        public int getPartition(GramKey key, Gram value, int numPartitions) {
          // exclude first byte which is the key type 
          byte[] keyBytesWithoutTypeByte = new byte[key.getPrimaryLength()-1]; 
          System.arraycopy(key.getBytes(), 1, keyBytesWithoutTypeByte, 0, keyBytesWithoutTypeByte.length); 
          int hash = WritableComparator.hashBytes(keyBytesWithoutTypeByte, keyBytesWithoutTypeByte.length);
          return (hash & Integer.MAX_VALUE) % numPartitions;    
        }
        
      }
      

        Attachments

        1. MAHOUT-937.patch
          2 kB
          Sean R. Owen
        2. GramKeyPartitioner.java
          2 kB
          Mat Kelcey

          Activity

            People

            • Assignee:
              srowen Sean R. Owen
              Reporter:
              mat_kelcey Mat Kelcey
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: