Mahout
  1. Mahout
  2. MAHOUT-479

Streamline classification/ clustering data structures

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.1, 0.2, 0.3, 0.4
    • Fix Version/s: 0.6
    • Component/s: Classification, Clustering
    • Labels:
      None

      Description

      Opening this JIRA issue to collect ideas on how to streamline our classification and clustering algorithms to make integration for users easier as per mailing list thread http://markmail.org/message/pnzvrqpv5226twfs

      Jake and Robin and I were talking the other evening and a common lament was that our classification (and clustering) stuff was all over the map in terms of data structures. Driving that to rest and getting those comments even vaguely as plug and play as our much more advanced recommendation components would be very, very helpful.

      This issue probably also realates to MAHOUT-287 (intention there is to make naive bayes run on vectors as input).

      Ted, Jake, Robin: Would be great if someone of you could add a comment on some of the issues you discussed "the other evening" and (if applicable) any minor or major changes you think could help solve this issue.

        Issue Links

          Activity

          Hide
          Isabel Drost-Fromm added a comment - - edited

          Some thoughts that come to my mind:

          • Algorithm implementations should be able to rely on getting their input as vectors+.
          • To make plug'n'play of algorithms easy we need to provide sensible default parameters for each implementation (for spectral clustering that would include adding a default strategy for computing an affinity matrix from a set of item vectors).
          • Parameters must be easy to override.
          • Integrate with the vector generation classes in mahout.util - should we move anything feature related that is still in core there?
          • Need a set of common interfaces for classification algorithms (methods train, classify etc. come to mind) so implementations of these can be exchanged easily.

          Probably have forgotten like dozens of other open questions - any input welcome.

          + Could potentially help with MAHOUT-287, however I need some help understanding the existing code.

          Show
          Isabel Drost-Fromm added a comment - - edited Some thoughts that come to my mind: Algorithm implementations should be able to rely on getting their input as vectors+. To make plug'n'play of algorithms easy we need to provide sensible default parameters for each implementation (for spectral clustering that would include adding a default strategy for computing an affinity matrix from a set of item vectors). Parameters must be easy to override. Integrate with the vector generation classes in mahout.util - should we move anything feature related that is still in core there? Need a set of common interfaces for classification algorithms (methods train, classify etc. come to mind) so implementations of these can be exchanged easily. Probably have forgotten like dozens of other open questions - any input welcome. + Could potentially help with MAHOUT-287 , however I need some help understanding the existing code.
          Hide
          Drew Farris added a comment -

          Thanks for getting the ball rolling Isabel

          More discussion from the mailing list can be found here http://markmail.org/thread/tarn6f4ump5zn67n

          One thought I remember reading somewhere was instead of using a common datatype as input for classifiers and clusterers, provide something like an InputSource interface – which defines how things are read and can be implemented for any number of physical input types, e.g: text files, serialized vectors in sequence files, various types of distributed data stores)... we provide a number of input sources with the appropriate methods and allow users of Mahout to implement their own.

          Some other discussion I'm particularly interested in is how we might unify models across the classification and clustering code.

          Show
          Drew Farris added a comment - Thanks for getting the ball rolling Isabel More discussion from the mailing list can be found here http://markmail.org/thread/tarn6f4ump5zn67n One thought I remember reading somewhere was instead of using a common datatype as input for classifiers and clusterers, provide something like an InputSource interface – which defines how things are read and can be implemented for any number of physical input types, e.g: text files, serialized vectors in sequence files, various types of distributed data stores)... we provide a number of input sources with the appropriate methods and allow users of Mahout to implement their own. Some other discussion I'm particularly interested in is how we might unify models across the classification and clustering code.
          Hide
          Ted Dunning added a comment -

          Unification of the resulting models is probably much easier than the unification of the model building process itself.

          Some of the problems I have seen include:

          a) all of our clustering and classification models should be able to accept vectors and produce either a "most-likely" category or a vector of scores for all possible categories. Unfortunately, there is no uniform way to load a model from a file and no uniform object structure for these models and no consistent way to call them.

          b) most of our learning algorithms would be happy with vectors, but there is a pretty fundamental difference between good ways to call hadoop-based and sequential training algorithms. The sequential stuff is traditional java so the interface is very easy. The parallel stuff is considerably harder to make into a really good interface. We may learn some tricks with Plume or we may be able to use the Distributed Row Matrix, but it isn't an obvious answer.

          c) in some cases, the vectors are noticeably larger than the original data. This occurs when the original data is very sparse and we are looking at lots of interaction variables. Again, for sequential algorithms, this is pretty easy to deal with, but for parallel ones, it really might be better to store the original data and pass in a function that handles the vectorization on the fly.

          Show
          Ted Dunning added a comment - Unification of the resulting models is probably much easier than the unification of the model building process itself. Some of the problems I have seen include: a) all of our clustering and classification models should be able to accept vectors and produce either a "most-likely" category or a vector of scores for all possible categories. Unfortunately, there is no uniform way to load a model from a file and no uniform object structure for these models and no consistent way to call them. b) most of our learning algorithms would be happy with vectors, but there is a pretty fundamental difference between good ways to call hadoop-based and sequential training algorithms. The sequential stuff is traditional java so the interface is very easy. The parallel stuff is considerably harder to make into a really good interface. We may learn some tricks with Plume or we may be able to use the Distributed Row Matrix, but it isn't an obvious answer. c) in some cases, the vectors are noticeably larger than the original data. This occurs when the original data is very sparse and we are looking at lots of interaction variables. Again, for sequential algorithms, this is pretty easy to deal with, but for parallel ones, it really might be better to store the original data and pass in a function that handles the vectorization on the fly.
          Hide
          Ted Dunning added a comment -

          Regarding vectorization strategies in general, I know of three that we currently use:

          a) full-size vector model. Each continuous variable and each unique term for each text-like variable and each unique interaction term is assigned a unique location in the feature vector. This requires that we know the vocabulary and the number of interaction combinations that exist. This is the strategy used by the Lucene to vector converter.

          b) pass around the text. This is what Naive Bayes currently does. The result is kind of like a duplicated vector implementation with strings as the indices. The current implementation has difficult with different fields containing text and with continuous variables. This avoids a dictionary building pass, but leads to other difficulties unless we figure out a way to inject a text vectorizer.

          c) feature hashing. This is what SGD supports. Here, we pick the size of the vectors a priori. Words are assigned a location based on a hash of the variable name and the word value. Continuous variables are assigned locations based on the variable name. It can be moderately difficult to reverse engineer a vector back to features since there can be ambiguity with very large feature spaces, but it isn't necessary to build a dictionary in order to make vectors.

          Show
          Ted Dunning added a comment - Regarding vectorization strategies in general, I know of three that we currently use: a) full-size vector model. Each continuous variable and each unique term for each text-like variable and each unique interaction term is assigned a unique location in the feature vector. This requires that we know the vocabulary and the number of interaction combinations that exist. This is the strategy used by the Lucene to vector converter. b) pass around the text. This is what Naive Bayes currently does. The result is kind of like a duplicated vector implementation with strings as the indices. The current implementation has difficult with different fields containing text and with continuous variables. This avoids a dictionary building pass, but leads to other difficulties unless we figure out a way to inject a text vectorizer. c) feature hashing. This is what SGD supports. Here, we pick the size of the vectors a priori. Words are assigned a location based on a hash of the variable name and the word value. Continuous variables are assigned locations based on the variable name. It can be moderately difficult to reverse engineer a vector back to features since there can be ambiguity with very large feature spaces, but it isn't necessary to build a dictionary in order to make vectors.
          Hide
          Ted Dunning added a comment -

          I am about to put up a patch related to MAHOUT-228 that proposes some interfaces for on-line learning and vector classification.

          These interfaces are probably too special purpose to be considered an answer to MAHOUT-479, but they indicate where my own work has been leading me lately.

          Show
          Ted Dunning added a comment - I am about to put up a patch related to MAHOUT-228 that proposes some interfaces for on-line learning and vector classification. These interfaces are probably too special purpose to be considered an answer to MAHOUT-479 , but they indicate where my own work has been leading me lately.
          Hide
          Ted Dunning added a comment -

          I just moved the encoding objects associated with MAHOUT-228 to org.apache.mahout.vectors to provide a nucleus for feature encoding.

          There are also a fair number of things in oam.text and oam.utils that are related. Since those are in the utils module, however, I couldn't leverage them. We may want to consider moving some of them to core to allow wider use.

          Show
          Ted Dunning added a comment - I just moved the encoding objects associated with MAHOUT-228 to org.apache.mahout.vectors to provide a nucleus for feature encoding. There are also a fair number of things in oam.text and oam.utils that are related. Since those are in the utils module, however, I couldn't leverage them. We may want to consider moving some of them to core to allow wider use.
          Hide
          Jeff Eastman added a comment -

          I'm making some progress on integrating the clustering data structures. Here's how the top levels are shaping up:

          // this defines the minimal Dirichlet model
          public interface Model<O> extends Writable {
            double pdf(O x);
            void observe(O x);
            int count();
            void computeParameters();
            public Model<VectorWritable> sampleFromPosterior();
          }
          
          // this puts a face on a cluster
          public interface Cluster {
            int getId();
            Vector getCenter();
            Vector getRadius();
            int getNumPoints();
            String asFormatString(String[] bindings);
            String asJsonString();
          }
          
          // here's the resulting Cluster class hierarchy
          public abstract class AbstractCluster implements Cluster, Model<VectorWritable> {}
            public abstract class DistanceMeasureCluster extends AbstractCluster {}
              public class Canopy extends DistanceMeasureCluster {}
              public class Cluster extends DistanceMeasureCluster {}
                public class MeanShiftCanopy extends Cluster {}
                public class SoftCluster extendsCluster {}
            public class GaussianCluster extends AbstractCluster {}
          
          // Note: all the current Dirichlet models can be subsumed by GaussianCluster or simple subclasses
          

          There's a fair amount of cleanup to tests etc to do but I thought I'd post this for visibility.

          Show
          Jeff Eastman added a comment - I'm making some progress on integrating the clustering data structures. Here's how the top levels are shaping up: // this defines the minimal Dirichlet model public interface Model<O> extends Writable { double pdf(O x); void observe(O x); int count(); void computeParameters(); public Model<VectorWritable> sampleFromPosterior(); } // this puts a face on a cluster public interface Cluster { int getId(); Vector getCenter(); Vector getRadius(); int getNumPoints(); String asFormatString( String [] bindings); String asJsonString(); } // here's the resulting Cluster class hierarchy public abstract class AbstractCluster implements Cluster, Model<VectorWritable> {} public abstract class DistanceMeasureCluster extends AbstractCluster {} public class Canopy extends DistanceMeasureCluster {} public class Cluster extends DistanceMeasureCluster {} public class MeanShiftCanopy extends Cluster {} public class SoftCluster extendsCluster {} public class GaussianCluster extends AbstractCluster {} // Note: all the current Dirichlet models can be subsumed by GaussianCluster or simple subclasses There's a fair amount of cleanup to tests etc to do but I thought I'd post this for visibility.
          Hide
          Ted Dunning added a comment -

          Jeff,

          How do you think that these relate to AbstractVectorClassifier?

          Does it fit reasonably as a super-class of the models?

          Show
          Ted Dunning added a comment - Jeff, How do you think that these relate to AbstractVectorClassifier? Does it fit reasonably as a super-class of the models?
          Hide
          Jeff Eastman added a comment -

          I don't see AbstractVectorClassifier as a super-class of these models, since it needs to operate on a set of models rather than an individual model. What I see is something more like below. I realize classify() does not return the right sized vector, but how else to normalize it properly given that an arbitrary set of model pdfs won't sum to 1? Also, what about making AbstractVectorClassifier work over VectorWritables instead of Vectors? All the clustering code uses VWs.

          public class VectorModelClassifier extends AbstractVectorClassifier {
          
            List<Model<VectorWritable>> models;
          
            public VectorModelClassifier(List<Model<VectorWritable>> models) {
              super();
              this.models = models;
            }
          
            @Override
            public Vector classify(Vector instance) {
              Vector pdfs = new DenseVector(models.size());
              int i = 0;
              for (Model<VectorWritable> model : models) {
                pdfs.set(i++, model.pdf(new VectorWritable(instance)));
              }
              return pdfs.assign(new TimesFunction(), 1.0 / pdfs.zSum());
            }
          
            @Override
            public double classifyScalar(Vector instance) {
              if (models.size() == 2) {
                double pdf0 = models.get(0).pdf(new VectorWritable(instance));
                double pdf1 = models.get(1).pdf(new VectorWritable(instance));
                return pdf0 / (pdf0 + pdf1);
              }
              throw new IllegalStateException();
            }
          
            @Override
            public int numCategories() {
              return models.size();
            }
          }
          
          Show
          Jeff Eastman added a comment - I don't see AbstractVectorClassifier as a super-class of these models, since it needs to operate on a set of models rather than an individual model. What I see is something more like below. I realize classify() does not return the right sized vector, but how else to normalize it properly given that an arbitrary set of model pdfs won't sum to 1? Also, what about making AbstractVectorClassifier work over VectorWritables instead of Vectors? All the clustering code uses VWs. public class VectorModelClassifier extends AbstractVectorClassifier { List<Model<VectorWritable>> models; public VectorModelClassifier(List<Model<VectorWritable>> models) { super (); this .models = models; } @Override public Vector classify(Vector instance) { Vector pdfs = new DenseVector(models.size()); int i = 0; for (Model<VectorWritable> model : models) { pdfs.set(i++, model.pdf( new VectorWritable(instance))); } return pdfs.assign( new TimesFunction(), 1.0 / pdfs.zSum()); } @Override public double classifyScalar(Vector instance) { if (models.size() == 2) { double pdf0 = models.get(0).pdf( new VectorWritable(instance)); double pdf1 = models.get(1).pdf( new VectorWritable(instance)); return pdf0 / (pdf0 + pdf1); } throw new IllegalStateException(); } @Override public int numCategories() { return models.size(); } }
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #202 (See https://hudson.apache.org/hudson/job/Mahout-Quality/202/)
          MAHOUT-479: Refactored Model and Cluster hierarchies to include DistanceMeasure in cluster state. All unit tests run

          Show
          Hudson added a comment - Integrated in Mahout-Quality #202 (See https://hudson.apache.org/hudson/job/Mahout-Quality/202/ ) MAHOUT-479 : Refactored Model and Cluster hierarchies to include DistanceMeasure in cluster state. All unit tests run
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #208 (See https://hudson.apache.org/hudson/job/Mahout-Quality/208/)
          MAHOUT-479: This commit refactors Cluster to inherit from Model<VectorWritable> instead of AbstractCluster
          which now inherits just Cluster. The existing Dirichlet models also now inherit from Cluster, simplifying the use
          of generics and cleaning up a lot of the code.

          Since Dirichlet now can iterate over arbitrary Clusters as its models, this opens up the entire set of DistanceMeasure
          based clusters for Dirichlet processing. This should allow the output of e.g. Canopy to become the prior model
          distribution in a subsequent Dirichlet step. Is this a feature?

          The AbstractCluster hierarchy has been adjusted allowing GaussianClusterDistribution and
          DistanceMeasureClusterDistribution to instantiate its subclasses. Both have tests and seem to work as expected.

          All unit tests run. More to follow

          Show
          Hudson added a comment - Integrated in Mahout-Quality #208 (See https://hudson.apache.org/hudson/job/Mahout-Quality/208/ ) MAHOUT-479 : This commit refactors Cluster to inherit from Model<VectorWritable> instead of AbstractCluster which now inherits just Cluster. The existing Dirichlet models also now inherit from Cluster, simplifying the use of generics and cleaning up a lot of the code. Since Dirichlet now can iterate over arbitrary Clusters as its models, this opens up the entire set of DistanceMeasure based clusters for Dirichlet processing. This should allow the output of e.g. Canopy to become the prior model distribution in a subsequent Dirichlet step. Is this a feature? The AbstractCluster hierarchy has been adjusted allowing GaussianClusterDistribution and DistanceMeasureClusterDistribution to instantiate its subclasses. Both have tests and seem to work as expected. All unit tests run. More to follow
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #213 (See https://hudson.apache.org/hudson/job/Mahout-Quality/213/)
          MAHOUT-479: Fixed a bug in TestVectorModelClassifier and messed around with pdf() in GaussianCluster and ASNModel. Synthetic control seems to work better on Dirichlet now but I'm troubled by the impact of the two pdf() implementations on the outcome.

          Show
          Hudson added a comment - Integrated in Mahout-Quality #213 (See https://hudson.apache.org/hudson/job/Mahout-Quality/213/ ) MAHOUT-479 : Fixed a bug in TestVectorModelClassifier and messed around with pdf() in GaussianCluster and ASNModel. Synthetic control seems to work better on Dirichlet now but I'm troubled by the impact of the two pdf() implementations on the outcome.
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #241 (See https://hudson.apache.org/hudson/job/Mahout-Quality/241/)

          Show
          Hudson added a comment - Integrated in Mahout-Quality #241 (See https://hudson.apache.org/hudson/job/Mahout-Quality/241/ )
          Hide
          Ted Dunning added a comment -

          I did a commit recently that introduced ModelDissector. This is useful for reverse engineering feature hashed models.

          The idea is that the hashed encoders have the option of having a trace dictionary. This tells us where each feature is hashed to, or each feature/value combination in the case of word-like values. Using this dictionary, we can put values into a synthetic feature vector in just the locations specified by a single feature or interaction. Then we can push this through a linear model to see the contribution of that input. For any generalized linear model like logistic regression, there is a linear part of the model that allows this.

          What the ModelDissector does is to accept a trace dictionary and a model in an update method. Then in a flush method, the biggest weights are returned. This update/flush style is used so that the trace dictionary doesn't have to grow to enormous levels, but instead can be cleared
          between updates.

          Show
          Ted Dunning added a comment - I did a commit recently that introduced ModelDissector. This is useful for reverse engineering feature hashed models. The idea is that the hashed encoders have the option of having a trace dictionary. This tells us where each feature is hashed to, or each feature/value combination in the case of word-like values. Using this dictionary, we can put values into a synthetic feature vector in just the locations specified by a single feature or interaction. Then we can push this through a linear model to see the contribution of that input. For any generalized linear model like logistic regression, there is a linear part of the model that allows this. What the ModelDissector does is to accept a trace dictionary and a model in an update method. Then in a flush method, the biggest weights are returned. This update/flush style is used so that the trace dictionary doesn't have to grow to enormous levels, but instead can be cleared between updates.
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #300 (See https://hudson.apache.org/hudson/job/Mahout-Quality/300/)
          MAHOUT-479 - Added javadoc explanations.

          Show
          Hudson added a comment - Integrated in Mahout-Quality #300 (See https://hudson.apache.org/hudson/job/Mahout-Quality/300/ ) MAHOUT-479 - Added javadoc explanations.
          Hide
          Jeff Eastman added a comment -

          Moving this from neverland to 0.5. A lot has been accomplished in 0.4 in the clustering jobs but more work is needed to converge with classification, SVD and perhaps recommenders

          Show
          Jeff Eastman added a comment - Moving this from neverland to 0.5. A lot has been accomplished in 0.4 in the clustering jobs but more work is needed to converge with classification, SVD and perhaps recommenders
          Hide
          Sean Owen added a comment -

          I like the idea of getting this in for 0.5, building on the work already done. Is there a reasonable "stopping point" in sight that would be a good place to leave it for 0.5? I'm supposing we'll start heading to release within a month.

          Show
          Sean Owen added a comment - I like the idea of getting this in for 0.5, building on the work already done. Is there a reasonable "stopping point" in sight that would be a good place to leave it for 0.5? I'm supposing we'll start heading to release within a month.
          Hide
          Sean Owen added a comment -

          Are there any near-term changes that are worth getting in for 0.5? It is looking like this won't be fully resolved for 0.5

          Show
          Sean Owen added a comment - Are there any near-term changes that are worth getting in for 0.5? It is looking like this won't be fully resolved for 0.5
          Hide
          Ted Dunning added a comment -

          I think we have bits of this already in, but the grand unified theory is
          still lacking.

          I wouldn't expect it for 0.5.

          Show
          Ted Dunning added a comment - I think we have bits of this already in, but the grand unified theory is still lacking. I wouldn't expect it for 0.5.
          Hide
          Jeff Eastman added a comment - - edited

          Ted: Yeah... this is what I had in mind when I said grand unified theory.

          On Wed, Apr 13, 2011 at 9:24 PM, Jeff Eastman <jdog@windwardsolutions.com>wrote:

          > This could potentially collapse kmeans, fuzzyk and Dirichlet into a single implementation too:
          >
          > - Begin with a prior ClusterClassifier containing the appropriate sort of Cluster, in clusters-n
          > - For each input Vector, compute the pdf vector using CC.classify()
          > – For kmeans, train the most likely model from the pdf vector
          > – For Dirichlet, train the model selected by the multinomial of the pfd vector * mixture vector
          > – For fuzzyk, train each model by its normalized pdf (would need a new classify method for this)
          > - Close the CC, computing all posterior model parameters
          > - Serialize the CC into clusters-n+1
          >
          > Now that would really be cool
          >
          >
          > On 4/13/11 9:00 PM, Jeff Eastman wrote:
          >
          >> Here's how I got there:
          >>
          >> - ClusterClassifier holds a "List<Cluster> models;" field as its only state just like VectorModelClassifier does
          >> - Started with ModelSerializerTest since you suggested being compatible with ModelSerializer
          >> - This tests OnlineLogisticRegression, CrossFoldLearner and AdaptiveLogisticRegression
          >> - The first two are also subclasses of AbstractVectorClassifier just like ClusterClassifier
          >> - The tests pass OLR and CFL learners to train(OnlineLearner) so it made sense for a CC to be an OL too
          >> - The new CC.train(...) methods map to "models.get(actual).observe()" in Cluster.observe(V)
          >> - CC.close() maps to cluster.computeParameters() for each model which computes the posterior cluster parameters
          >> - Now the CC is ready for another iteration or to classify, etc.
          >>
          >> So, the cluster iteration process starts with a prior List<Cluster>
          >> which is used to construct the ClusterClassifier. Then in each
          >> iteration each point is passed to CC.classify() and the maximum
          >> probability element index in the returned Vector is used to train()
          >> the CC. Since all the DistanceMeasureClusters contain their
          >> appropriate DistanceMeasure, the one with the maximum pdf() is the
          >> closest. Just what kmeans already does but done less efficiently (it
          >> uses just the minimum distance, but pdf() = e^-distance so the closest cluster has the largest pdf()).
          >>
          >> Finally, instead of passing in a List<Cluster> in the KMeansClusterer
          >> I can just carry around a CC which wraps it. Instead of serializing a
          >> List<Cluster> at the end of each iteration I can just serialize the
          >> CC. At the beginning of the next iteration, I just deserialize it and go.

          Show
          Jeff Eastman added a comment - - edited Ted: Yeah... this is what I had in mind when I said grand unified theory. On Wed, Apr 13, 2011 at 9:24 PM, Jeff Eastman <jdog@windwardsolutions.com>wrote: > This could potentially collapse kmeans, fuzzyk and Dirichlet into a single implementation too: > > - Begin with a prior ClusterClassifier containing the appropriate sort of Cluster, in clusters-n > - For each input Vector, compute the pdf vector using CC.classify() > – For kmeans, train the most likely model from the pdf vector > – For Dirichlet, train the model selected by the multinomial of the pfd vector * mixture vector > – For fuzzyk, train each model by its normalized pdf (would need a new classify method for this) > - Close the CC, computing all posterior model parameters > - Serialize the CC into clusters-n+1 > > Now that would really be cool > > > On 4/13/11 9:00 PM, Jeff Eastman wrote: > >> Here's how I got there: >> >> - ClusterClassifier holds a "List<Cluster> models;" field as its only state just like VectorModelClassifier does >> - Started with ModelSerializerTest since you suggested being compatible with ModelSerializer >> - This tests OnlineLogisticRegression, CrossFoldLearner and AdaptiveLogisticRegression >> - The first two are also subclasses of AbstractVectorClassifier just like ClusterClassifier >> - The tests pass OLR and CFL learners to train(OnlineLearner) so it made sense for a CC to be an OL too >> - The new CC.train(...) methods map to "models.get(actual).observe()" in Cluster.observe(V) >> - CC.close() maps to cluster.computeParameters() for each model which computes the posterior cluster parameters >> - Now the CC is ready for another iteration or to classify, etc. >> >> So, the cluster iteration process starts with a prior List<Cluster> >> which is used to construct the ClusterClassifier. Then in each >> iteration each point is passed to CC.classify() and the maximum >> probability element index in the returned Vector is used to train() >> the CC. Since all the DistanceMeasureClusters contain their >> appropriate DistanceMeasure, the one with the maximum pdf() is the >> closest. Just what kmeans already does but done less efficiently (it >> uses just the minimum distance, but pdf() = e^-distance so the closest cluster has the largest pdf()). >> >> Finally, instead of passing in a List<Cluster> in the KMeansClusterer >> I can just carry around a CC which wraps it. Instead of serializing a >> List<Cluster> at the end of each iteration I can just serialize the >> CC. At the beginning of the next iteration, I just deserialize it and go.
          Hide
          Jeff Eastman added a comment -

          Here's an interesting piece of code. It nets out the clustering process in terms of the ClusterClassifier, which is an AbstractVectorClassifier which implements OnlineLearner and Writable. The policy for kmeans just returns the index of the max pdf element and the policy for dirichlet returns the multinomial of the mixture times the pdf. The really exciting thing to me is how, for clustering, we classify and then train whereas for classification it is the opposite order. Very symmetric and, uh, now rather obvious.

            private ClusteringPolicy policy;
          
            /**
             * Iterate over data using a prior-trained classifier, for a number of iterations
             * @param data a List<Vector> of input vectors
             * @param prior the prior-trained ClusterClassifier
             * @param numIterations the int number of iterations to perform
             * @return the posterior ClusterClassifier
             */
            public ClusterClassifier iterate(List<Vector> data, ClusterClassifier prior, int numIterations) {
              for (int iteration = 1; iteration <= numIterations; iteration++) {
                for (Vector vector : data) {
                  // classification yields probabilities
                  Vector pdfs = prior.classify(vector);
                  // policy selects a model given those probabilities
                  int selected = policy.select(pdfs);
                  // training causes all models to observe data
                  prior.train(selected, vector);
                }
                // compute the posterior models
                prior.close();
                // update the policy
                policy.update(prior);
                }
              }
              return prior;
            }
          
          Show
          Jeff Eastman added a comment - Here's an interesting piece of code. It nets out the clustering process in terms of the ClusterClassifier, which is an AbstractVectorClassifier which implements OnlineLearner and Writable. The policy for kmeans just returns the index of the max pdf element and the policy for dirichlet returns the multinomial of the mixture times the pdf. The really exciting thing to me is how, for clustering, we classify and then train whereas for classification it is the opposite order. Very symmetric and, uh, now rather obvious. private ClusteringPolicy policy; /** * Iterate over data using a prior-trained classifier, for a number of iterations * @param data a List<Vector> of input vectors * @param prior the prior-trained ClusterClassifier * @param numIterations the int number of iterations to perform * @ return the posterior ClusterClassifier */ public ClusterClassifier iterate(List<Vector> data, ClusterClassifier prior, int numIterations) { for ( int iteration = 1; iteration <= numIterations; iteration++) { for (Vector vector : data) { // classification yields probabilities Vector pdfs = prior.classify(vector); // policy selects a model given those probabilities int selected = policy.select(pdfs); // training causes all models to observe data prior.train(selected, vector); } // compute the posterior models prior.close(); // update the policy policy.update(prior); } } return prior; }
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #769 (See https://builds.apache.org/hudson/job/Mahout-Quality/769/)

          Show
          Hudson added a comment - Integrated in Mahout-Quality #769 (See https://builds.apache.org/hudson/job/Mahout-Quality/769/ )
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #805 (See https://builds.apache.org/hudson/job/Mahout-Quality/805/)
          MAHOUT-479: added a new iterate method to ClusterIterator. Method accepts 3
          hadoop Paths for input, prior and output information plus number of desired iterations. All algorithm data is pulled-from/pushed-to SequenceFiles. Added a unit test and improved the example DisplayKMeans, DisplayFuzzyKMeans and DisplayDirichlet to use the new file-based implementation. Check out Dirichlet.

          Show
          Hudson added a comment - Integrated in Mahout-Quality #805 (See https://builds.apache.org/hudson/job/Mahout-Quality/805/ ) MAHOUT-479 : added a new iterate method to ClusterIterator. Method accepts 3 hadoop Paths for input, prior and output information plus number of desired iterations. All algorithm data is pulled-from/pushed-to SequenceFiles. Added a unit test and improved the example DisplayKMeans, DisplayFuzzyKMeans and DisplayDirichlet to use the new file-based implementation. Check out Dirichlet.
          Hide
          Vasil Vasilev added a comment -

          Hi Jeff,

          I am running Dirichlet clustering over the Reuters data set. Due to issues with the resulting clusters I found that there is a problem with the function calculating the pdf in org.apache.mahout.clustering.dirichlet.models.GaussianCluster and with the change you made recently:
          1. Part of the clusters have very small radius << 0. This leads to UncommonDistributions.dNorm returning 0.0 in case the point is at a bigger distance from the mean
          2. dNorm returns probability density, not probability, which means that for the cases where radius << 0 and the number of dimensions of the feature vectors is very big (~50000) the pdf goes quickly to infinity.
          3. In case 1 and 2 happen the result for the pdf is NaN

          Show
          Vasil Vasilev added a comment - Hi Jeff, I am running Dirichlet clustering over the Reuters data set. Due to issues with the resulting clusters I found that there is a problem with the function calculating the pdf in org.apache.mahout.clustering.dirichlet.models.GaussianCluster and with the change you made recently: 1. Part of the clusters have very small radius << 0. This leads to UncommonDistributions.dNorm returning 0.0 in case the point is at a bigger distance from the mean 2. dNorm returns probability density, not probability, which means that for the cases where radius << 0 and the number of dimensions of the feature vectors is very big (~50000) the pdf goes quickly to infinity. 3. In case 1 and 2 happen the result for the pdf is NaN
          Hide
          Sean Owen added a comment -

          Am I right that this is done? Looks so from the JIRA commits. Or at least, is it better to continue further work in a new JIRA?

          Show
          Sean Owen added a comment - Am I right that this is done? Looks so from the JIRA commits. Or at least, is it better to continue further work in a new JIRA?

            People

            • Assignee:
              Isabel Drost-Fromm
              Reporter:
              Isabel Drost-Fromm
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development