Mahout
  1. Mahout
  2. MAHOUT-3

Build initial canopy clustering prototype

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.1
    • Component/s: Clustering
    • Labels:
      None

      Description

      I'd like to reserve some namespace, specifically org.apache.mahout.clustering.canopy to use for an initial prototype of canopy clustering. I'm going to start with a little unit test to get the basic algorithm sorted out, then M/R it.

      1. MAHOUT-3.diff
        13 kB
        Jeff Eastman
      2. MAHOUT-3a.diff
        40 kB
        Jeff Eastman
      3. MAHOUT-3b.diff
        46 kB
        Jeff Eastman
      4. MAHOUT-3c.diff
        64 kB
        Jeff Eastman
      5. MAHOUT-3d.diff
        67 kB
        Jeff Eastman
      6. MAHOUT-3e.diff
        68 kB
        Jeff Eastman
      7. MAHOUT-3f.diff
        68 kB
        Jeff Eastman
      8. MAHOUT-3g.patch
        77 kB
        Grant Ingersoll

        Activity

        Hide
        Jeff Eastman added a comment -

        Here's an initial patch which introduces a couple of unit tests that implement a very basic canopy cluster, two distance measures and a Canopy class. It is not M/R ready and is just the beginning.

        • src/main/java/org/apache/mahout/clustering/canopy/Canopy.java: new class
          (constructor): create a new Canopy with the given point
          (add): add another point to the Canopy
          (toString): return a printable representation
          (ptOut): return with the point's information represented
        • src/main/java/org/apache/mahout/clustering/canopy/DistanceMeasure.java: new interface
          (distance): single method returns the distance metric
        • src/main/java/org/apache/mahout/clustering/canopy/ManhattanDistancemeasure.java: new class
          (distance): single method returns the Manhattan distance metric
        • src/main/java/org/apache/mahout/clustering/canopy/EuclidianDistanceMeasure.java: new class
          (distance): single method returns the Euclidian distance metric
        • src/test/java/org/apache/mahout/clustering/canopy/TestCanopy.java
          (testOne, testTwo): new unit tests
          (getPoints, makeCanopy, prtCanopies): utilities
        Show
        Jeff Eastman added a comment - Here's an initial patch which introduces a couple of unit tests that implement a very basic canopy cluster, two distance measures and a Canopy class. It is not M/R ready and is just the beginning. src/main/java/org/apache/mahout/clustering/canopy/Canopy.java: new class (constructor): create a new Canopy with the given point (add): add another point to the Canopy (toString): return a printable representation (ptOut): return with the point's information represented src/main/java/org/apache/mahout/clustering/canopy/DistanceMeasure.java: new interface (distance): single method returns the distance metric src/main/java/org/apache/mahout/clustering/canopy/ManhattanDistancemeasure.java: new class (distance): single method returns the Manhattan distance metric src/main/java/org/apache/mahout/clustering/canopy/EuclidianDistanceMeasure.java: new class (distance): single method returns the Euclidian distance metric src/test/java/org/apache/mahout/clustering/canopy/TestCanopy.java (testOne, testTwo): new unit tests (getPoints, makeCanopy, prtCanopies): utilities
        Hide
        Jeff Eastman added a comment -

        Initial implementation of Canopy generation phase of two-phase Canopy
        Clustering algorithm. See unit tests for the evolution of the user
        stories leading to the working implementation.

        TODO: Implement the actual clustering of the original points using
        the canopy centers produced by this implementation.

        TODO: Sort out the generics

        TODO: Allow points to be sparse, to carry payloads for use by other
        subsystems, ...

        All unit tests run.

        • src/main/java/org/apache/mahout/clustering/canopy
        • Canopy.java
          (addPointToCanopies): applies the distance metric to all canopies,
          adding the point to those that are covered
          (getCentroid): returns the initial centroid
          (getNumPoints): returns the number of points added
          (computeCentroid): normalizes the pointTotals with tne numPoints
          to return a computed centroid for the canopy
          (ptOut, toString, formatPoint): utilities
        • CanopyDriver.java
          (main): the main program
          (runJob): static used by unit tests
        • CanopyMapper.java
          (map): the map function assigns points to canopies
          (config): configuration provided for unit tests
          (configure): reads distance measure and threshold from job
          (close): writes the canopy centroids to the output
        • CanopyReducer.java
          (reduce): the reduce function assigns points to canopies
          (config): configuration provided for unit tests
          (configure): reads distance measure and threshold from job
          (close): writes the canopy centroids to the output
        • DistanceMeasure.java
          (distance): comput the distance between two points by some measure
        • EuclideanDistanceMeasure.java
          (distance): comput the distance between two points by Euclidean measure
        • ManhattanDistanceMeasure.java
          (distance): comput the distance between two points by Manhattan measure
        • src/test/java/org/apache/mahout/clustering/canopy
        • DummyOutputCollector.java
          (collect): collects output data
          (getData): returns output data for unit tests
        • TestCanopy.java
          (addPoint): overrides Canopy method to add point to a list
          (toString): overrides Canopy method to add point printout
        • TestCanopyCreation.java
          (setUp): uses published algorithm to initialize reference data
          (testReferenceManhattan, testReferenceEuclidean): validates reference data
          (testIterativeManhattan, testIterativeEuclidean): uses optimized
          algorithm and verifies result vs. reference data
          (testCanopyMapperManhattan, testCanopyMapperEuclidean,
          testCanopyReducerManhattan, testCanopyReducerEuclidean): excercises
          mapper and reducer with test data
          (testManhattanMR, testEuclideanMR): runs Hadoop jobs and verifies
          resulting canopy centroids
        Show
        Jeff Eastman added a comment - Initial implementation of Canopy generation phase of two-phase Canopy Clustering algorithm. See unit tests for the evolution of the user stories leading to the working implementation. TODO: Implement the actual clustering of the original points using the canopy centers produced by this implementation. TODO: Sort out the generics TODO: Allow points to be sparse, to carry payloads for use by other subsystems, ... All unit tests run. src/main/java/org/apache/mahout/clustering/canopy Canopy.java (addPointToCanopies): applies the distance metric to all canopies, adding the point to those that are covered (getCentroid): returns the initial centroid (getNumPoints): returns the number of points added (computeCentroid): normalizes the pointTotals with tne numPoints to return a computed centroid for the canopy (ptOut, toString, formatPoint): utilities CanopyDriver.java (main): the main program (runJob): static used by unit tests CanopyMapper.java (map): the map function assigns points to canopies (config): configuration provided for unit tests (configure): reads distance measure and threshold from job (close): writes the canopy centroids to the output CanopyReducer.java (reduce): the reduce function assigns points to canopies (config): configuration provided for unit tests (configure): reads distance measure and threshold from job (close): writes the canopy centroids to the output DistanceMeasure.java (distance): comput the distance between two points by some measure EuclideanDistanceMeasure.java (distance): comput the distance between two points by Euclidean measure ManhattanDistanceMeasure.java (distance): comput the distance between two points by Manhattan measure src/test/java/org/apache/mahout/clustering/canopy DummyOutputCollector.java (collect): collects output data (getData): returns output data for unit tests TestCanopy.java (addPoint): overrides Canopy method to add point to a list (toString): overrides Canopy method to add point printout TestCanopyCreation.java (setUp): uses published algorithm to initialize reference data (testReferenceManhattan, testReferenceEuclidean): validates reference data (testIterativeManhattan, testIterativeEuclidean): uses optimized algorithm and verifies result vs. reference data (testCanopyMapperManhattan, testCanopyMapperEuclidean, testCanopyReducerManhattan, testCanopyReducerEuclidean): excercises mapper and reducer with test data (testManhattanMR, testEuclideanMR): runs Hadoop jobs and verifies resulting canopy centroids
        Hide
        Grant Ingersoll added a comment -

        One minor nit for your next patch: please run the diff in the basedir https://svn.apache.org/repos/asf/lucene/mahout/trunk, as I don't have a "mahout" directory due to having several versions checked out.

        Show
        Grant Ingersoll added a comment - One minor nit for your next patch: please run the diff in the basedir https://svn.apache.org/repos/asf/lucene/mahout/trunk , as I don't have a "mahout" directory due to having several versions checked out.
        Hide
        Grant Ingersoll added a comment -

        Couple of minor comments:

        I updated the build.xml to have a test target. You might want to rename the TestCanopy helper test b/c it causes the Test** and/or **Test includes in ANT to think it is a test case.

        Really starting to come together. Looks like you have a lot of good tests and good documentation. I haven't gotten into the real substance yet, but this is definitely promising.

        Show
        Grant Ingersoll added a comment - Couple of minor comments: I updated the build.xml to have a test target. You might want to rename the TestCanopy helper test b/c it causes the Test** and/or **Test includes in ANT to think it is a test case. Really starting to come together. Looks like you have a lot of good tests and good documentation. I haven't gotten into the real substance yet, but this is definitely promising.
        Hide
        Jeff Eastman added a comment - - edited

        Improved implementation of Canopy generation phase of two-phase Canopy
        Clustering algorithm. See unit tests for the evolution of the user
        stories leading to the working implementation.

        This implementation incorporates Ted Dunning's comments on my original approach.
        In particular, it does not rely upon emitting data during the close() operation.
        During the map phase, subsets of the input points are assigned to canopies
        by each mapper and output to a combiner which then computes and outputs the
        canopy centroids for each subset. During the reduce phase, the centroids are
        again clustered into a final set of canopies which are output.

        This also incorporates Grant Ingersoll's comments on the name of the Canopy
        subclass (now VisibleCanopy vs. TestCanopy) and the .diff file is done from
        inside the project root.

        TODO: Implement the actual clustering of the original points using
        the canopy centers produced by this implementation.

        TODO: Sort out the generics

        TODO: Allow points to be sparse, to carry payloads for use by other
        subsystems, ...

        All unit tests run.

        • src/main/java/org/apache/mahout/clustering/canopy
        • Canopy.java
          (configure): sets the distance measure, t1 and t2 statics for subsequent
          operations. Assumes all canopies created by this class loader will
          have the same properties.
          (addPointToCanopies): applies the distance metric to all canopies,
          adding the point to those that are covered
          (emitPointToCanopies): same algorithm but used by mapper to output
          points with canopyIds to CanopyCombiner
          (addPoint): add a point to the pointTotals and bump numPoints
          (emitPoint): output the point to the collector thence to the combiner
          (getCenter): returns the canopy center
          (getNumPoints): returns the number of points in the canopy
          (getCanopyId): returns the canopyId
          (computeCentroid): normalizes the pointTotals with tne numPoints
          to return a computed centroid for the canopy
          (formatPoint, decodePoint): encoding/decoding for points
          (formatCanopy, decodeCanopy): encoding/decoding for canopies
          (ptOut, toString): utilities
        • CanopyDriver.java
          (main): the main program
          (runJob): static used by unit tests
        • CanopyMapper.java
          (map): the map function assigns points to canopies outputting each
          point to each of its canopies
          (configure): reads distance measure and thresholds from job and
          configures Canopy.
        • CanopyCombiner.java
          (reduce): computes & writes the canopy centroids to the output using
          a single "centroid" key
          (configure): reads distance measure and thresholds from job and
          configures Canopy.
        • CanopyReducer.java
          (reduce): the reduce function assigns points to canopies
          (configure): reads distance measure and thresholds from job and
          configures Canopy.
        • DistanceMeasure.java
          (distance): compute the distance between two points by some measure
        • EuclideanDistanceMeasure.java
          (distance): comput the distance between two points by Euclidean measure
        • ManhattanDistanceMeasure.java
          (distance): comput the distance between two points by Manhattan measure
        • src/test/java/org/apache/mahout/clustering/canopy
        • DummyOutputCollector.java
          (collect): collects output data in a map
          (getData): returns output data for unit tests
          (getKeys): returns the key set
          (getValue): returns the value associated with the key
        • VisibleCanopy.java
          (addPoint): overrides Canopy method to add point to a list
          (toString): overrides Canopy method to add point printout
        • TestCanopyCreation.java
          (setUp): uses published algorithm to initialize reference data
          (testReferenceManhattan, testReferenceEuclidean): validates reference data
          (testIterativeManhattan, testIterativeEuclidean): uses optimized
          algorithm and verifies result vs. reference data
          (testCanopyMapperManhattan, testCanopyMapperEuclidean,
          testCanopyReducerManhattan, testCanopyReducerEuclidean): excercises
          mapper/combiner and reducer with test data
          (testManhattanMR, testEuclideanMR): runs Hadoop jobs and verifies
          resulting canopy centroids
        Show
        Jeff Eastman added a comment - - edited Improved implementation of Canopy generation phase of two-phase Canopy Clustering algorithm. See unit tests for the evolution of the user stories leading to the working implementation. This implementation incorporates Ted Dunning's comments on my original approach. In particular, it does not rely upon emitting data during the close() operation. During the map phase, subsets of the input points are assigned to canopies by each mapper and output to a combiner which then computes and outputs the canopy centroids for each subset. During the reduce phase, the centroids are again clustered into a final set of canopies which are output. This also incorporates Grant Ingersoll's comments on the name of the Canopy subclass (now VisibleCanopy vs. TestCanopy) and the .diff file is done from inside the project root. TODO: Implement the actual clustering of the original points using the canopy centers produced by this implementation. TODO: Sort out the generics TODO: Allow points to be sparse, to carry payloads for use by other subsystems, ... All unit tests run. src/main/java/org/apache/mahout/clustering/canopy Canopy.java (configure): sets the distance measure, t1 and t2 statics for subsequent operations. Assumes all canopies created by this class loader will have the same properties. (addPointToCanopies): applies the distance metric to all canopies, adding the point to those that are covered (emitPointToCanopies): same algorithm but used by mapper to output points with canopyIds to CanopyCombiner (addPoint): add a point to the pointTotals and bump numPoints (emitPoint): output the point to the collector thence to the combiner (getCenter): returns the canopy center (getNumPoints): returns the number of points in the canopy (getCanopyId): returns the canopyId (computeCentroid): normalizes the pointTotals with tne numPoints to return a computed centroid for the canopy (formatPoint, decodePoint): encoding/decoding for points (formatCanopy, decodeCanopy): encoding/decoding for canopies (ptOut, toString): utilities CanopyDriver.java (main): the main program (runJob): static used by unit tests CanopyMapper.java (map): the map function assigns points to canopies outputting each point to each of its canopies (configure): reads distance measure and thresholds from job and configures Canopy. CanopyCombiner.java (reduce): computes & writes the canopy centroids to the output using a single "centroid" key (configure): reads distance measure and thresholds from job and configures Canopy. CanopyReducer.java (reduce): the reduce function assigns points to canopies (configure): reads distance measure and thresholds from job and configures Canopy. DistanceMeasure.java (distance): compute the distance between two points by some measure EuclideanDistanceMeasure.java (distance): comput the distance between two points by Euclidean measure ManhattanDistanceMeasure.java (distance): comput the distance between two points by Manhattan measure src/test/java/org/apache/mahout/clustering/canopy DummyOutputCollector.java (collect): collects output data in a map (getData): returns output data for unit tests (getKeys): returns the key set (getValue): returns the value associated with the key VisibleCanopy.java (addPoint): overrides Canopy method to add point to a list (toString): overrides Canopy method to add point printout TestCanopyCreation.java (setUp): uses published algorithm to initialize reference data (testReferenceManhattan, testReferenceEuclidean): validates reference data (testIterativeManhattan, testIterativeEuclidean): uses optimized algorithm and verifies result vs. reference data (testCanopyMapperManhattan, testCanopyMapperEuclidean, testCanopyReducerManhattan, testCanopyReducerEuclidean): excercises mapper/combiner and reducer with test data (testManhattanMR, testEuclideanMR): runs Hadoop jobs and verifies resulting canopy centroids
        Hide
        Jeff Eastman added a comment -

        A working implementation of a Canopy Clustering algorithm. See unit tests for
        the evolution of the user stories leading to the full implementation.

        This implementation incorporates Ted Dunning's comments on my original
        approach to canopy generation. In particular, it does not rely upon emitting data
        during the close() operation of the CanopyMapper or CanopyReducer.
        During the map phase, subsets of the input points are assigned to canopies
        by each mapper and output to a combiner which then computes and outputs the
        canopy centroids for each subset. During the reduce phase, the centroids are
        again clustered into a final set of canopies which are output.

        This patch also incorporates Grant Ingersoll's comments on the name of the
        Canopy subclass (now VisibleCanopy vs. TestCanopy) and the .diff file is done
        from inside the project root.

        NEW: This patch implements the actual clustering of the original points using
        the canopy centers produced by the cluster generation phase.

        TODO: Sort out the generics

        TODO: Allow the CanopyReducer to take different (e.g. smaller) threshold values
        so that canopy coalescing will not be so aggressive.

        TODO: Allow points to carry payloads for use by other subsystems, to be
        sparse, ...

        All unit tests run.

        • src/main/java/org/apache/mahout/clustering/canopy
        • Canopy.java
          (configure): sets the distance measure, t1 and t2 statics for subsequent
          operations. Assumes all canopies created by this class loader will
          have the same properties.
          (addPointToCanopies): applies the distance metric to all canopies,
          adding the point to those that are covered
          (emitPointToNewCanopies): same algorithm but used by CanopyMapper to
          output points with canopyIds to CanopyCombiner
          (emitPointToExistingCanopies): checks the distance and emits the point
          with each canopy definition as key. Emits the point to the closest
          canopy if canopy center clustering has moved the centroids so that
          the point is slightly outside of an existing canopy.
          (addPoint): add a point to the pointTotals and bump numPoints
          (emitPoint): output the point to the collector thence to the combiner
          (getCenter): returns the canopy center
          (getNumPoints): returns the number of points in the canopy
          (getCanopyId): returns the canopyId
          (computeCentroid): normalizes the pointTotals with tne numPoints
          to return a computed centroid for the canopy
          (formatPoint, decodePoint): encoding/decoding for points
          (formatCanopy, decodeCanopy): encoding/decoding for canopies
          (covers): returns if the point is covered by the canopy
          (ptOut, toString): utilities
        • CanopyDriver.java
          (main): the main program
          (runJob): static used by unit tests
        • CanopyMapper.java
          (map): the map function assigns points to canopies outputting each
          point to each of its canopies
          (configure): reads distance measure and thresholds from job and
          configures Canopy.
        • CanopyCombiner.java
          (reduce): computes & writes the canopy centroids to the output using
          a single "centroid" key
          (configure): reads distance measure and thresholds from job and
          configures Canopy.
        • CanopyReducer.java
          (reduce): the reduce function assigns points to canopies
          (configure): reads distance measure and thresholds from job and
          configures Canopy.
        • ClusterMapper.java
          (map): the map function assigns points to existing canopies outputting
          each point to each of its canopies
          (configure): reads distance measure and thresholds from job and
          configures Canopy. Also reads canopy definitions from produced by
          the CanopyReducer.
        • ClusterDriver.java
          (main): the main program uses IdentityReducers
          (runJob): static used by unit tests
        • Job.java
          (main): the main program invokes CanopyDriver and ClusterDriver
          (runJob): static used by unit tests
        • DistanceMeasure.java
          (distance): compute the distance between two points by some measure
        • EuclideanDistanceMeasure.java
          (distance): comput the distance between two points by Euclidean measure
        • ManhattanDistanceMeasure.java
          (distance): comput the distance between two points by Manhattan measure
        • src/test/java/org/apache/mahout/clustering/canopy
        • DummyOutputCollector.java
          (collect): collects output data in a map
          (getData): returns output data for unit tests
          (getKeys): returns the key set
          (getValue): returns the value associated with the key
        • VisibleCanopy.java
          (addPoint): overrides Canopy method to add point to a list
          (toString): overrides Canopy method to add point printout
        • TestCanopyCreation.java
          (setUp): uses published algorithm to initialize reference data
          (testReferenceManhattan, testReferenceEuclidean): validates reference data
          (testIterativeManhattan, testIterativeEuclidean): uses optimized
          algorithm and verifies result vs. reference data
          (testCanopyMapperManhattan, testCanopyMapperEuclidean,
          testCanopyReducerManhattan, testCanopyReducerEuclidean): excercises
          mapper/combiner and reducer with test data
          (testManhattanMR, testEuclideanMR): runs Hadoop jobs and verifies
          resulting canopy centroids
          (testClusterMapperManhattan, testClusterMapperEuclidean,
          testClusterReducerManhattan, testClusterReducerEuclidean): excercises
          mapper and reducer with test data, testing clustering correctness
          (testClusteringManhattanMR, testClusteringEuclideanMR): runs both
          canopy generation and clustering to print out results
        Show
        Jeff Eastman added a comment - A working implementation of a Canopy Clustering algorithm. See unit tests for the evolution of the user stories leading to the full implementation. This implementation incorporates Ted Dunning's comments on my original approach to canopy generation. In particular, it does not rely upon emitting data during the close() operation of the CanopyMapper or CanopyReducer. During the map phase, subsets of the input points are assigned to canopies by each mapper and output to a combiner which then computes and outputs the canopy centroids for each subset. During the reduce phase, the centroids are again clustered into a final set of canopies which are output. This patch also incorporates Grant Ingersoll's comments on the name of the Canopy subclass (now VisibleCanopy vs. TestCanopy) and the .diff file is done from inside the project root. NEW: This patch implements the actual clustering of the original points using the canopy centers produced by the cluster generation phase. TODO: Sort out the generics TODO: Allow the CanopyReducer to take different (e.g. smaller) threshold values so that canopy coalescing will not be so aggressive. TODO: Allow points to carry payloads for use by other subsystems, to be sparse, ... All unit tests run. src/main/java/org/apache/mahout/clustering/canopy Canopy.java (configure): sets the distance measure, t1 and t2 statics for subsequent operations. Assumes all canopies created by this class loader will have the same properties. (addPointToCanopies): applies the distance metric to all canopies, adding the point to those that are covered (emitPointToNewCanopies): same algorithm but used by CanopyMapper to output points with canopyIds to CanopyCombiner (emitPointToExistingCanopies): checks the distance and emits the point with each canopy definition as key. Emits the point to the closest canopy if canopy center clustering has moved the centroids so that the point is slightly outside of an existing canopy. (addPoint): add a point to the pointTotals and bump numPoints (emitPoint): output the point to the collector thence to the combiner (getCenter): returns the canopy center (getNumPoints): returns the number of points in the canopy (getCanopyId): returns the canopyId (computeCentroid): normalizes the pointTotals with tne numPoints to return a computed centroid for the canopy (formatPoint, decodePoint): encoding/decoding for points (formatCanopy, decodeCanopy): encoding/decoding for canopies (covers): returns if the point is covered by the canopy (ptOut, toString): utilities CanopyDriver.java (main): the main program (runJob): static used by unit tests CanopyMapper.java (map): the map function assigns points to canopies outputting each point to each of its canopies (configure): reads distance measure and thresholds from job and configures Canopy. CanopyCombiner.java (reduce): computes & writes the canopy centroids to the output using a single "centroid" key (configure): reads distance measure and thresholds from job and configures Canopy. CanopyReducer.java (reduce): the reduce function assigns points to canopies (configure): reads distance measure and thresholds from job and configures Canopy. ClusterMapper.java (map): the map function assigns points to existing canopies outputting each point to each of its canopies (configure): reads distance measure and thresholds from job and configures Canopy. Also reads canopy definitions from produced by the CanopyReducer. ClusterDriver.java (main): the main program uses IdentityReducers (runJob): static used by unit tests Job.java (main): the main program invokes CanopyDriver and ClusterDriver (runJob): static used by unit tests DistanceMeasure.java (distance): compute the distance between two points by some measure EuclideanDistanceMeasure.java (distance): comput the distance between two points by Euclidean measure ManhattanDistanceMeasure.java (distance): comput the distance between two points by Manhattan measure src/test/java/org/apache/mahout/clustering/canopy DummyOutputCollector.java (collect): collects output data in a map (getData): returns output data for unit tests (getKeys): returns the key set (getValue): returns the value associated with the key VisibleCanopy.java (addPoint): overrides Canopy method to add point to a list (toString): overrides Canopy method to add point printout TestCanopyCreation.java (setUp): uses published algorithm to initialize reference data (testReferenceManhattan, testReferenceEuclidean): validates reference data (testIterativeManhattan, testIterativeEuclidean): uses optimized algorithm and verifies result vs. reference data (testCanopyMapperManhattan, testCanopyMapperEuclidean, testCanopyReducerManhattan, testCanopyReducerEuclidean): excercises mapper/combiner and reducer with test data (testManhattanMR, testEuclideanMR): runs Hadoop jobs and verifies resulting canopy centroids (testClusterMapperManhattan, testClusterMapperEuclidean, testClusterReducerManhattan, testClusterReducerEuclidean): excercises mapper and reducer with test data, testing clustering correctness (testClusteringManhattanMR, testClusteringEuclideanMR): runs both canopy generation and clustering to print out results
        Hide
        Grant Ingersoll added a comment -

        TODO: Allow points to carry payloads for use by other subsystems, to be
        sparse, ...

        Can you elaborate on what you have in mind?

        Show
        Grant Ingersoll added a comment - TODO: Allow points to carry payloads for use by other subsystems, to be sparse, ... Can you elaborate on what you have in mind?
        Hide
        Jeff Eastman added a comment -

        Perhaps a better term would be 'attributes': Some application-specific string appended to the input line after the [...] which would be used by other applications for their own purposes. For example, if I'm clustering documents by their proximity in an n-d feature space, I might want their names to be passed through the clustering process.

        Show
        Jeff Eastman added a comment - Perhaps a better term would be 'attributes': Some application-specific string appended to the input line after the [...] which would be used by other applications for their own purposes. For example, if I'm clustering documents by their proximity in an n-d feature space, I might want their names to be passed through the clustering process.
        Hide
        Jeff Eastman added a comment -

        This patch adds "payloads" to the previous patch, by passing the ClusterMapper input Writable intact through to the Canopy emit method so that any additional information beyond the point definition propagates through to the output. It is actually a bit more efficient to do it this way, since the point does not need to be reformatted upon collection. I've also added two unit tests thereof.

        I also added a space after the comma in the point formatting routines to make the output more human-readable.

        I've run this in a larger M/R job producing 50+ clusters from thousands of points having 25+ dimensions and it seems to be ready for broader use.

        Show
        Jeff Eastman added a comment - This patch adds "payloads" to the previous patch, by passing the ClusterMapper input Writable intact through to the Canopy emit method so that any additional information beyond the point definition propagates through to the output. It is actually a bit more efficient to do it this way, since the point does not need to be reformatted upon collection. I've also added two unit tests thereof. I also added a space after the comma in the point formatting routines to make the output more human-readable. I've run this in a larger M/R job producing 50+ clusters from thousands of points having 25+ dimensions and it seems to be ready for broader use.
        Hide
        Grant Ingersoll added a comment -

        I'm getting test errors when running ant test on the 3d.diff:

        ------------- Standard Error -----------------
        [junit] 08/02/16 09:14:53 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
        [junit] java.io.IOException: apache-Mahout-0.1-dev.jar: No such file or directory
        [junit] at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:142)
        [junit] at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:116)
        [junit] at org.apache.hadoop.fs.LocalFileSystem.copyFromLocalFile(LocalFileSystem.java:49)
        [junit] at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:796)
        [junit] at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:493)
        [junit] at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:753)
        [junit] at org.apache.mahout.clustering.canopy.CanopyDriver.runJob(CanopyDriver.java:74)
        [junit] at org.apache.mahout.clustering.canopy.TestCanopyCreation.testCanopyGenManhattanMR(TestCanopyCreation.java:450)

        and:

        Testcase: testCanopyGenManhattanMR(org.apache.mahout.clustering.canopy.TestCanopyCreation): Caused an ERROR
        [junit] output/canopies/part-00000
        [junit] java.io.FileNotFoundException: output/canopies/part-00000
        [junit] at org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:142)
        [junit] at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:117)
        [junit] at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:274)
        [junit] at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1356)
        [junit] at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1349)
        [junit] at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1344)
        [junit] at org.apache.mahout.clustering.canopy.TestCanopyCreation.testCanopyGenManhattanMR(TestCanopyCreation.java:458)
        [junit]
        [junit]
        [junit] Testcase: testCanopyGenEuclideanMR(org.apache.mahout.clustering.canopy.TestCanopyCreation): Caused an ERROR
        [junit] output/canopies/part-00000
        [junit] java.io.FileNotFoundException: output/canopies/part-00000
        [junit] at org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:142)
        [junit] at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:117)
        [junit] at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:274)
        [junit] at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1356)
        [junit] at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1349)
        [junit] at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1344)
        [junit] at org.apache.mahout.clustering.canopy.TestCanopyCreation.testCanopyGenEuclideanMR(TestCanopyCreation.java:493)

        Show
        Grant Ingersoll added a comment - I'm getting test errors when running ant test on the 3d.diff: ------------- Standard Error ----------------- [junit] 08/02/16 09:14:53 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= [junit] java.io.IOException: apache-Mahout-0.1-dev.jar: No such file or directory [junit] at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:142) [junit] at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:116) [junit] at org.apache.hadoop.fs.LocalFileSystem.copyFromLocalFile(LocalFileSystem.java:49) [junit] at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:796) [junit] at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:493) [junit] at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:753) [junit] at org.apache.mahout.clustering.canopy.CanopyDriver.runJob(CanopyDriver.java:74) [junit] at org.apache.mahout.clustering.canopy.TestCanopyCreation.testCanopyGenManhattanMR(TestCanopyCreation.java:450) and: Testcase: testCanopyGenManhattanMR(org.apache.mahout.clustering.canopy.TestCanopyCreation): Caused an ERROR [junit] output/canopies/part-00000 [junit] java.io.FileNotFoundException: output/canopies/part-00000 [junit] at org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:142) [junit] at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:117) [junit] at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:274) [junit] at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1356) [junit] at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1349) [junit] at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1344) [junit] at org.apache.mahout.clustering.canopy.TestCanopyCreation.testCanopyGenManhattanMR(TestCanopyCreation.java:458) [junit] [junit] [junit] Testcase: testCanopyGenEuclideanMR(org.apache.mahout.clustering.canopy.TestCanopyCreation): Caused an ERROR [junit] output/canopies/part-00000 [junit] java.io.FileNotFoundException: output/canopies/part-00000 [junit] at org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:142) [junit] at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:117) [junit] at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:274) [junit] at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1356) [junit] at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1349) [junit] at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1344) [junit] at org.apache.mahout.clustering.canopy.TestCanopyCreation.testCanopyGenEuclideanMR(TestCanopyCreation.java:493)
        Hide
        Jeff Eastman added a comment -

        This patch refactors the canopy configuration from the various mappers, combiners & reducers configure method into a single static configure method on Canopy. I changed the DistanceMeasure creation from explicit tests to more generic class instance creation and added a unit test thereof. I also made DistanceMeasure extend JobConfigurable so that they can be configured. Configurability will allow me to create a WeightedManhattanDistanceMeasure outside of the mahout library. It seems making distance measures job configurable increases their versatility at no cost to the library.

        Finally, all unit tests still run. The build.xml test target works for me too.

        Show
        Jeff Eastman added a comment - This patch refactors the canopy configuration from the various mappers, combiners & reducers configure method into a single static configure method on Canopy. I changed the DistanceMeasure creation from explicit tests to more generic class instance creation and added a unit test thereof. I also made DistanceMeasure extend JobConfigurable so that they can be configured. Configurability will allow me to create a WeightedManhattanDistanceMeasure outside of the mahout library. It seems making distance measures job configurable increases their versatility at no cost to the library. Finally, all unit tests still run. The build.xml test target works for me too.
        Hide
        Grant Ingersoll added a comment -

        Hi Jeff,

        Now I am getting compile errors. I did "ant clean test" after applying the patch on an empty directory.

        compile-test:
        [mkdir] Created dir: /Volumes/User/grantingersoll/projects/lucene/mahout/mahout-clean/build/test-classes
        [javac] Compiling 3 source files to ..../projects/lucene/mahout/mahout-clean/build/test-classes
        [javac] ..../projects/lucene/mahout/mahout-clean/src/test/java/org/apache/mahout/clustering/canopy/TestCanopyCreation.java:780: cannot find symbol
        [javac] symbol : class UserDefinedDistanceMeasure
        [javac] location: class org.apache.mahout.clustering.canopy.TestCanopyCreation
        [javac] UserDefinedDistanceMeasure.class.getName(), (float) 3.1, (float) 2.1);
        [javac] ^
        [javac] Note:..../lucene/mahout/mahout-clean/src/test/java/org/apache/mahout/clustering/canopy/TestCanopyCreation.java uses unchecked or unsafe operations.
        [javac] Note: Recompile with -Xlint:unchecked for details.
        [javac] 1 error

        Show
        Grant Ingersoll added a comment - Hi Jeff, Now I am getting compile errors. I did "ant clean test" after applying the patch on an empty directory. compile-test: [mkdir] Created dir: /Volumes/User/grantingersoll/projects/lucene/mahout/mahout-clean/build/test-classes [javac] Compiling 3 source files to ..../projects/lucene/mahout/mahout-clean/build/test-classes [javac] ..../projects/lucene/mahout/mahout-clean/src/test/java/org/apache/mahout/clustering/canopy/TestCanopyCreation.java:780: cannot find symbol [javac] symbol : class UserDefinedDistanceMeasure [javac] location: class org.apache.mahout.clustering.canopy.TestCanopyCreation [javac] UserDefinedDistanceMeasure.class.getName(), (float) 3.1, (float) 2.1); [javac] ^ [javac] Note:..../lucene/mahout/mahout-clean/src/test/java/org/apache/mahout/clustering/canopy/TestCanopyCreation.java uses unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [javac] 1 error
        Hide
        Grant Ingersoll added a comment -

        I think you forgot to svn add UserDefinedDistanceMeasure.

        Show
        Grant Ingersoll added a comment - I think you forgot to svn add UserDefinedDistanceMeasure.
        Hide
        Jeff Eastman added a comment -

        I forgot to svn add the UserDefinedDistanceMeasure before doing the last diff. This one has the class and runs all unit tests.

        Show
        Jeff Eastman added a comment - I forgot to svn add the UserDefinedDistanceMeasure before doing the last diff. This one has the class and runs all unit tests.
        Hide
        Grant Ingersoll added a comment -

        Added some minor updates:

        • Added ASL headers on some files
        • The test target now requires the JAR to be built, since the setJar is being called.
        • Added parameter to pass in the JAR location since dist-jar does not put the jar location in the working directory by default.

        It's a little weird to have the JAR created before the tests are run. Perhaps we should create something of an internal JAR in a tmp directory first for tests to use, then after the tests pass, we can create/copy the JAR to the official dist. area.

        Otherwise, looks good. I will plan on committing in the next day or two.

        Show
        Grant Ingersoll added a comment - Added some minor updates: Added ASL headers on some files The test target now requires the JAR to be built, since the setJar is being called. Added parameter to pass in the JAR location since dist-jar does not put the jar location in the working directory by default. It's a little weird to have the JAR created before the tests are run. Perhaps we should create something of an internal JAR in a tmp directory first for tests to use, then after the tests pass, we can create/copy the JAR to the official dist. area. Otherwise, looks good. I will plan on committing in the next day or two.
        Hide
        Grant Ingersoll added a comment -

        Jeff,

        I noticed in ClusterDriver in the main() that the canopies argument (args[1]) is ignored. That seems a bit strange. Is this just a relic from an older way of doing it?

        public static void main(String[] args) {
        String points = args[0];
        //HERE
        String canopies = args[1];
        String output = args[2];
        String measureClassName = args[3];
        float t1 = new Float(args[4]);
        float t2 = new Float(args[5]);
        String jarLocation = "apache-mahout-0.1-dev.jar";
        if (args.length > 6)

        { jarLocation = args[6]; }

        runJob(points, null, output, measureClassName, t1, t2, jarLocation);
        }

        I can just decrement them, if that is cool.

        Show
        Grant Ingersoll added a comment - Jeff, I noticed in ClusterDriver in the main() that the canopies argument (args [1] ) is ignored. That seems a bit strange. Is this just a relic from an older way of doing it? public static void main(String[] args) { String points = args [0] ; //HERE String canopies = args [1] ; String output = args [2] ; String measureClassName = args [3] ; float t1 = new Float(args [4] ); float t2 = new Float(args [5] ); String jarLocation = "apache-mahout-0.1-dev.jar"; if (args.length > 6) { jarLocation = args[6]; } runJob(points, null, output, measureClassName, t1, t2, jarLocation); } I can just decrement them, if that is cool.
        Hide
        Grant Ingersoll added a comment -

        Committed revision 629348.

        Thanks, Jeff!

        Show
        Grant Ingersoll added a comment - Committed revision 629348. Thanks, Jeff!

          People

          • Assignee:
            Grant Ingersoll
            Reporter:
            Jeff Eastman
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development