Mahout
  1. Mahout
  2. MAHOUT-937

Collocations Job Partitioner not being configured properly

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.5
    • Fix Version/s: 0.6
    • Component/s: None
    • Labels:
      None

      Description

      The first pass of the collocations discovery job (as described by CollocDriver.generateCollocations) uses the org.apache.mahout.vectorizer.collocations.llr.GramKeyPartitioner partitioner.

      This partitoner has an instance variable offset that is supposed to be set by a call to setOffsets() but this call is never made (not sure why? is this method expected to be called by the Hadoop framework itself?)

      The offset not being set results in getPartition always returning 0 and so all intermediate data is sent to the one reducer.

      I couldn't quite understand what this partitioning was meant to be doing, but simply hashing the Grams primary string representation (ie without the leading 'type' byte) does what is required...

      public class GramKeyPartitioner extends Partitioner<GramKey, Gram> {
      
        @Override
        public int getPartition(GramKey key, Gram value, int numPartitions) {
          // exclude first byte which is the key type 
          byte[] keyBytesWithoutTypeByte = new byte[key.getPrimaryLength()-1]; 
          System.arraycopy(key.getBytes(), 1, keyBytesWithoutTypeByte, 0, keyBytesWithoutTypeByte.length); 
          int hash = WritableComparator.hashBytes(keyBytesWithoutTypeByte, keyBytesWithoutTypeByte.length);
          return (hash & Integer.MAX_VALUE) % numPartitions;    
        }
        
      }
      
      1. MAHOUT-937.patch
        2 kB
        Sean Owen
      2. GramKeyPartitioner.java
        2 kB
        Mat Kelcey

        Activity

        Mat Kelcey created issue -
        Mat Kelcey made changes -
        Field Original Value New Value
        Attachment GramKeyPartitioner.java [ 12508729 ]
        Mat Kelcey made changes -
        Attachment GramKeyPartitioner.java [ 12508729 ]
        Mat Kelcey made changes -
        Attachment GramKeyPartitioner.java [ 12508730 ]
        Mat Kelcey made changes -
        Description The first pass of the collocations discovery job (as described by CollocDriver.generateCollocations) uses the org.apache.mahout.vectorizer.collocations.llr.GramKeyPartitioner partitioner.

        This partitoner has an instance variable offset that is supposed to be set by a call to setOffsets() but this call is never made (not sure why? is this method expected to be called by the Hadoop framework itself?)

        The offset not being set results in getPartition always returning 0 and so all intermediate data is sent to the one reducer.

        I couldn't quite understand what this partitioning was meant to be doing, but simply hashing the Grams primary string representation (ie without the leading 'type' byte) does what is required...

        public class GramKeyPartitioner extends Partitioner<GramKey, Gram> {

          @Override
          public int getPartition(GramKey key, Gram value, int numPartitions) {
            // exclude first byte which is the key type
            byte[] keyBytesWithoutTypeByte = new byte[key.getPrimaryLength()-1];
            System.arraycopy(key.getBytes(), 1, keyBytesWithoutTypeByte, 0, keyBytesWithoutTypeByte.length);
            int hash = WritableComparator.hashBytes(keyBytesWithoutTypeByte, keyBytesWithoutTypeByte.length);
            return (hash & Integer.MAX_VALUE) % numPartitions;
          }
          
        }


        The first pass of the collocations discovery job (as described by CollocDriver.generateCollocations) uses the org.apache.mahout.vectorizer.collocations.llr.GramKeyPartitioner partitioner.

        This partitoner has an instance variable offset that is supposed to be set by a call to setOffsets() but this call is never made (not sure why? is this method expected to be called by the Hadoop framework itself?)

        The offset not being set results in getPartition always returning 0 and so all intermediate data is sent to the one reducer.

        I couldn't quite understand what this partitioning was meant to be doing, but simply hashing the Grams primary string representation (ie without the leading 'type' byte) does what is required...

        {code}
        public class GramKeyPartitioner extends Partitioner<GramKey, Gram> {

          @Override
          public int getPartition(GramKey key, Gram value, int numPartitions) {
            // exclude first byte which is the key type
            byte[] keyBytesWithoutTypeByte = new byte[key.getPrimaryLength()-1];
            System.arraycopy(key.getBytes(), 1, keyBytesWithoutTypeByte, 0, keyBytesWithoutTypeByte.length);
            int hash = WritableComparator.hashBytes(keyBytesWithoutTypeByte, keyBytesWithoutTypeByte.length);
            return (hash & Integer.MAX_VALUE) % numPartitions;
          }
          
        }
        {code}


        Hide
        Sean Owen added a comment -

        I agree this looks like it can't be right. The methods aren't called, and the result hashes 0 bytes every time. I think the simplest thing is to even avoid copying the byte array; here's my proposed patch, which has the same result. Tests pass. I'll commit soon if there are no objections since it seems like this can only be a fix, based on the semantics the tests imply.

        Show
        Sean Owen added a comment - I agree this looks like it can't be right. The methods aren't called, and the result hashes 0 bytes every time. I think the simplest thing is to even avoid copying the byte array; here's my proposed patch, which has the same result. Tests pass. I'll commit soon if there are no objections since it seems like this can only be a fix, based on the semantics the tests imply.
        Sean Owen made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Affects Version/s 0.5 [ 12315255 ]
        Assignee Sean Owen [ srowen ]
        Fix Version/s 0.6 [ 12316364 ]
        Sean Owen made changes -
        Attachment MAHOUT-937.patch [ 12508768 ]
        Hide
        Mat Kelcey added a comment -

        I should have checked WritableComparator.hashBytes first, for some reason I had it in my head that the hashing was special. Thanks!

        Show
        Mat Kelcey added a comment - I should have checked WritableComparator.hashBytes first, for some reason I had it in my head that the hashing was special. Thanks!
        Sean Owen made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1278 (See https://builds.apache.org/job/Mahout-Quality/1278/)
        MAHOUT-937 make partitioner send to different reducers (as intended it seems) by just using the hash of primary bytes

        srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1225420
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/collocations/llr/GramKeyPartitioner.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1278 (See https://builds.apache.org/job/Mahout-Quality/1278/ ) MAHOUT-937 make partitioner send to different reducers (as intended it seems) by just using the hash of primary bytes srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1225420 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/collocations/llr/GramKeyPartitioner.java
        Sean Owen made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Sean Owen
            Reporter:
            Mat Kelcey
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development