Mahout
  1. Mahout
  2. MAHOUT-749

MeanShift Cannot Use Multiple Reducers

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.5
    • Fix Version/s: 0.6
    • Component/s: Clustering
    • Labels:
      None

      Description

      The MeanShiftCanopy clustering job sets the numReducers=1 and this severely limits its scalability for larger jobs.

      1. MAHOUT-749.patch
        38 kB
        Jeff Eastman

        Activity

        Hide
        Jeff Eastman added a comment -

        This patch implements changes to the driver and mapper to utilize multiple reducers. The driver is modified to decrease the number of reducers in each iteration, finally to 1. The mapper is changed to send each of its outputs to a different reducer, depending upon the number deployed in the iteration. The unit tests are modified and run. This is ready for some experimentation with larger datasets and multiple reducers specified by -Dmapred.reduce.tasks.

        Show
        Jeff Eastman added a comment - This patch implements changes to the driver and mapper to utilize multiple reducers. The driver is modified to decrease the number of reducers in each iteration, finally to 1. The mapper is changed to send each of its outputs to a different reducer, depending upon the number deployed in the iteration. The unit tests are modified and run. This is ready for some experimentation with larger datasets and multiple reducers specified by -Dmapred.reduce.tasks.
        Hide
        Elmer Garduno added a comment -

        Hi Jeff,

        The CanopyDriver has the same problem, it also sets the numReducers=1 do you think that this kind of solution could also fix Canopy scalability issues?

        Show
        Elmer Garduno added a comment - Hi Jeff, The CanopyDriver has the same problem, it also sets the numReducers=1 do you think that this kind of solution could also fix Canopy scalability issues?
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #953 (See https://builds.apache.org/job/Mahout-Quality/953/)
        MAHOUT-749: Implemented multiple reducer approach from Jira patch, plus a scalability enhancement to avoid accumulating merged clusterIds if -cl option is not present. The defaults are for the same behavior as before. All tests run though this needs more testing to see how it really scales

        jeastman : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1149369
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopyMapper.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopy.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopyClusterer.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/meanshift/TestMeanShift.java
        • /mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/DisplayMeanShift.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopyDriver.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopyConfigKeys.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #953 (See https://builds.apache.org/job/Mahout-Quality/953/ ) MAHOUT-749 : Implemented multiple reducer approach from Jira patch, plus a scalability enhancement to avoid accumulating merged clusterIds if -cl option is not present. The defaults are for the same behavior as before. All tests run though this needs more testing to see how it really scales jeastman : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1149369 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopyMapper.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopy.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopyClusterer.java /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/meanshift/TestMeanShift.java /mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/DisplayMeanShift.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopyDriver.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopyConfigKeys.java
        Hide
        Sean Owen added a comment -

        Jeff am I right this is done? I'm marking for 0.6 in any event since looks like you're well into this.

        Show
        Sean Owen added a comment - Jeff am I right this is done? I'm marking for 0.6 in any event since looks like you're well into this.

          People

          • Assignee:
            Jeff Eastman
            Reporter:
            Jeff Eastman
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development