Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.6
    • Component/s: None
    • Labels:
      None

      Description

      KMeans currently on the map-side calculates the distance between a set of seeds and all other vectors. It would be handy to have a generalization of this that, given a set of vectors that fits in memory (the seeds) and other points, emit <seed id, other id, distance> according to the distance measure. This is similar to the RowSimilarityJob, but much simpler and not as general purpose.

      1. MAHOUT-763.patch
        13 kB
        Grant Ingersoll
      2. MAHOUT-763.patch
        13 kB
        Grant Ingersoll
      3. MAHOUT-763.patch
        16 kB
        Grant Ingersoll
      4. MAHOUT-763.patch
        17 kB
        Grant Ingersoll
      5. SeedVectorUtil.patch
        7 kB
        Sean Owen

        Activity

        Hide
        Grant Ingersoll added a comment -

        First draft of a patch. Input seeds can be vector, Cluster or Canopy. Output is <StringTuple, DoubleWritable> where the StringTuple is the name of each of the seed vector (it induces a NamedVector over the seeds depending on the input) and the second tuple entry is either the input key to the mapper or, if the input value is a NamedVector, the name of the vector. This could likely be parameterized a bit more so people could select.

        Show
        Grant Ingersoll added a comment - First draft of a patch. Input seeds can be vector, Cluster or Canopy. Output is <StringTuple, DoubleWritable> where the StringTuple is the name of each of the seed vector (it induces a NamedVector over the seeds depending on the input) and the second tuple entry is either the input key to the mapper or, if the input value is a NamedVector, the name of the vector. This could likely be parameterized a bit more so people could select.
        Hide
        Grant Ingersoll added a comment -

        fix import

        Show
        Grant Ingersoll added a comment - fix import
        Hide
        Grant Ingersoll added a comment -

        Fixed some issues w/ the job configuration

        Show
        Grant Ingersoll added a comment - Fixed some issues w/ the job configuration
        Hide
        Grant Ingersoll added a comment -

        Handles multiple seed files

        Show
        Grant Ingersoll added a comment - Handles multiple seed files
        Hide
        Grant Ingersoll added a comment -

        Committed revision 1147257

        Show
        Grant Ingersoll added a comment - Committed revision 1147257
        Hide
        Grant Ingersoll added a comment -

        Going to reopen to provide an alternate output form

        Show
        Grant Ingersoll added a comment - Going to reopen to provide an alternate output form
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #943 (See https://builds.apache.org/job/Mahout-Quality/943/)
        MAHOUT-763: add alternative output mapping
        MAHOUT-763: add map-side distance calculation
        MAHOUT-763: hook into bin/mahout

        gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1147318
        Files :

        • /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/similarity/TestVectorDistanceSimilarityJob.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/SeedVectorUtil.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/VectorDistanceSimilarityJob.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/VectorDistanceInvertedMapper.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/VectorDistanceMapper.java

        gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1147257
        Files :

        • /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/similarity/TestVectorDistanceSimilarityJob.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/VectorDistanceSimilarityJob.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/VectorDistanceMapper.java

        gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1147102
        Files :

        • /mahout/trunk/src/conf/driver.classes.props
        Show
        Hudson added a comment - Integrated in Mahout-Quality #943 (See https://builds.apache.org/job/Mahout-Quality/943/ ) MAHOUT-763 : add alternative output mapping MAHOUT-763 : add map-side distance calculation MAHOUT-763 : hook into bin/mahout gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1147318 Files : /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/similarity/TestVectorDistanceSimilarityJob.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/SeedVectorUtil.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/VectorDistanceSimilarityJob.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/VectorDistanceInvertedMapper.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/VectorDistanceMapper.java gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1147257 Files : /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/similarity/TestVectorDistanceSimilarityJob.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/VectorDistanceSimilarityJob.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/VectorDistanceMapper.java gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1147102 Files : /mahout/trunk/src/conf/driver.classes.props
        Hide
        Sean Owen added a comment -

        When the name `seedPath + "." + item++` is created – does including the seedPath matter much or is it just a helpful label? I've got a number of changes that would greatly simplify this file and the only behavior change would be losing access to `seedPath`.

        Show
        Sean Owen added a comment - When the name `seedPath + "." + item++` is created – does including the seedPath matter much or is it just a helpful label? I've got a number of changes that would greatly simplify this file and the only behavior change would be losing access to `seedPath`.
        Hide
        Grant Ingersoll added a comment -

        Helpful label. Put up a patch and I'll take a look.

        Show
        Grant Ingersoll added a comment - Helpful label. Put up a patch and I'll take a look.
        Hide
        Grant Ingersoll added a comment -

        The code is more or less a copy of what's in KMeans for loading the Cluster objects.

        Show
        Grant Ingersoll added a comment - The code is more or less a copy of what's in KMeans for loading the Cluster objects.
        Hide
        Sean Owen added a comment -

        This is what I had in mind – it looks like more change than it is due to whitespace. The key is just letting it take care of iterating over files in subdirs. Just a little tidier, not a big deal either way.

        I can make a similar change in kmeans.

        Show
        Sean Owen added a comment - This is what I had in mind – it looks like more change than it is due to whitespace. The key is just letting it take care of iterating over files in subdirs. Just a little tidier, not a big deal either way. I can make a similar change in kmeans.
        Hide
        Grant Ingersoll added a comment -

        +1

        Show
        Grant Ingersoll added a comment - +1
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #948 (See https://builds.apache.org/job/Mahout-Quality/948/)
        Style changes on MAHOUT-763 and new Pagerank code; mostly copyright header and simpler iteration over dirs of sequence files

        srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1147646
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/training/TrainUtils.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/SeedVectorUtil.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansClusterer.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/VectorDistanceSimilarityJob.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/VectorDistanceInvertedMapper.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/WeightedPropertyVectorWritable.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansClusterMapper.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/WeightedOccurrence.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/graph/triangles/EnumerateTrianglesJob.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/training/AbstractThetaTrainer.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/vectors/VectorDumper.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/df/data/DescriptorUtils.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansUtil.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/graph/linkanalysis/PageRankJob.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/RowSimilarityJob.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/WeightedOccurrenceArray.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/NaiveBayesModel.java
        • /mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/sgd/TrainAdaptiveLogistic.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/fuzzykmeans/FuzzyKMeansUtil.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/graph/triangles/VertexOrMarker.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/graph/common/GraphUtils.java
        • /mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/Job.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/graph/model/Edge.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/VectorDistanceMapper.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/WeightedVectorWritable.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #948 (See https://builds.apache.org/job/Mahout-Quality/948/ ) Style changes on MAHOUT-763 and new Pagerank code; mostly copyright header and simpler iteration over dirs of sequence files srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1147646 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/training/TrainUtils.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/SeedVectorUtil.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansClusterer.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/VectorDistanceSimilarityJob.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/VectorDistanceInvertedMapper.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/WeightedPropertyVectorWritable.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansClusterMapper.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/WeightedOccurrence.java /mahout/trunk/core/src/main/java/org/apache/mahout/graph/triangles/EnumerateTrianglesJob.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/training/AbstractThetaTrainer.java /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/vectors/VectorDumper.java /mahout/trunk/core/src/main/java/org/apache/mahout/df/data/DescriptorUtils.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansUtil.java /mahout/trunk/core/src/main/java/org/apache/mahout/graph/linkanalysis/PageRankJob.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/RowSimilarityJob.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/WeightedOccurrenceArray.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/NaiveBayesModel.java /mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/sgd/TrainAdaptiveLogistic.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/fuzzykmeans/FuzzyKMeansUtil.java /mahout/trunk/core/src/main/java/org/apache/mahout/graph/triangles/VertexOrMarker.java /mahout/trunk/core/src/main/java/org/apache/mahout/graph/common/GraphUtils.java /mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/Job.java /mahout/trunk/core/src/main/java/org/apache/mahout/graph/model/Edge.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/VectorDistanceMapper.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/WeightedVectorWritable.java

          People

          • Assignee:
            Grant Ingersoll
            Reporter:
            Grant Ingersoll
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development