Details

Type: New Feature

Status: Closed

Priority: Major

Resolution: Fixed

Affects Version/s: None

Fix Version/s: 0.3

Component/s: Clustering

Labels:None
Description
Copied over from original issue:
> Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly)
> nonparametric clustering algorithm.

 MAHOUT30.patch
 33 kB
 Jeff Eastman

 MAHOUT30b.patch
 45 kB
 Jeff Eastman

 MAHOUT30c.patch
 72 kB
 Jeff Eastman

 MAHOUT30d.patch
 66 kB
 Jeff Eastman

 MAHOUT30e.patch
 70 kB
 Jeff Eastman

 dirichlet2.tar
 690 kB
 Jeff Eastman

 screenshot1.jpg
 149 kB
 Jeff Eastman

 MAHOUT30f.patch
 143 kB
 Jeff Eastman
Issue Links
 is part of

MAHOUT4 Simple prototype for Expectation Maximization (EM)
 Closed
Activity
 All
 Comments
 Work Log
 History
 Activity
 Transitions
Final patch file is ready to commit. Need to add entry to pom for gson jar which is currently in core/lib.
generateSamples(500, 0, 0, 0.5);
generateSamples(500, 2, 0, 0.2);
generateSamples(500, 0, 2, 0.3);
generateSamples(500, 2, 2, 1);
DirichletDriver.runJob("input", "output",
"org.apache.mahout.clustering.dirichlet.SampledNormalDistribution", 20, 15, 1.0, 2);
 fixed bug in rBeta where arguments to rGamma were backwards
 fixed model std calculations to handle single sample case (s0=1 => std==0.0 which made resulting pdf==NaN and horked all subsequent cluster assignments)
 added cluster history display to examples with colors to illustrate cluster convergence
This is behaving pretty nicely now
This file contains a tar of a standalone Eclipse project named Dirichlet Cluster. The directory structure is selfcontained and only depends upon the Mahout project. It uses the Gson beta1.3 jar file for serialization/deserialization and that is included.
This version contains an initial Hadoop implementation that has been tested through 10 iterations. In each iteration, clusters are read from the previous iteration (in the statei directories) and points are assigned to clusters in the Mapper. The reducer then observes all points for each cluster and computes new clusters which are output for the subsequent iteration. The unit test TestMapReduce.testDriverMRIterations() creates 400 data points and runs the MR Driver. Then it gathers all of the state files and summarizes them on the console.
I noticed when I replaced the beta distribution code earlier that the clustering now tends to put everything into the same cluster. I'm suspicious about the beta values that are being computed and need to investigate this further.
I think this design will allow an arbitrary number of Mappers and Reducers up to the number of clusters. There is a stub Combiner class that is not currently used. I will continue to develop unit tests but I wanted to get this into view because it is a real first light MR implementation.
Jeff
Here's the patch file
This patch is completely selfcontained as it depends only upon the uncommons.math package for its distributions. I think this is ready to commit to trunk, after the release of course. The next steps will be to implement a M/R algorithm using this basic approach.
This patch moves the displayrelated classes into the examples subtree and removes the dependency upon commons.math which needs some extra patches to the release and is TBD. It still depends upon blog0.3 which I have only been able to find in source format. I will try to locate or build a suitable jar for that. I believe, since it is under BSD license we are ok but I'd like confirmation. The blog site is at http://people.csail.mit.edu/milch/blog/software.html.
That would be great. The wiki is really a better place.
Ted: I am moving the references over into the wiki page. Hope that is fine with you?
> Regarding the question of whether something should be called a Distribution or a Sampler, the mathematical terminology is that a distribution is
> something you can sample so the the Distribution terminology would be most compatible that way. The fact that only one method is currently
> defined is likely a temporary thing ... other methods could well be required for later efforts.
I understand. I do not have any strong objections. I think, a short class comment in DirichletDistribution would already help to avoid at least my confusion. (Although not trying to understand code at 2a.m. local time might help as well... )
> I was thinking of moving the display code into the examples directory.
Sounds like a great idea to me.
> I did that so Ted could use his favorite library but he has not been pursuing it. I'm happy with blog and, as commons does not have the
> needed sampling methods without Ted's patches, suggest we could go with blog. Removing the plugability would clean up the code some too.
Do you know what the current status of the patches is? I must admit I have a slight preference for commonsmath as well, in case they support what we need.
One good reference is this relatively dense article by McCullagh and Yang:
http://ba.stat.cmu.edu/journal/2008/vol03/issue01/yang.pdf
There is also a more approachable example in Chris Bishop's book on Machine Learning. See http://research.microsoft.com/enus/um/people/cmbishop/PRML/index.htm. I think that chapter 9 is where the example of clustering using a mixture model is found.
The Neal and Blei references from the McCullagh and Yang paper are also good. Zoubin Gharamani has some very nice tutorials out which describe why nonparametric Bayesian approaches to problems are very cool. One is at http://learning.eng.cam.ac.uk/zoubin/talks/uai05tutorialb.pdf but here are video versions about as well.
Regarding the question of whether something should be called a Distribution or a Sampler, the mathematical terminology is that a distribution is something you can sample so the the Distribution terminology would be most compatible that way. The fact that only one method is currently defined is likely a temporary thing ... other methods could well be required for later efforts.
On Fri, Dec 26, 2008 at 10:53 AM, Jeff Eastman (JIRA) <jira@apache.org> wrote:
... Some of those were terms Ted introduced from my original port of his R
example. I'm not hung up but perhaps we should include him in the
discussion?
Hi Isabel,
I'm so happy you had time to look through the code. Getting it to this
point was a great ordeal for me as the math is complicated and I have no
formal statistics background. Ted's help was critical in getting me to
the tipping point where I now understand the implementation well enough
to make progress on my own. I'm getting ready for a week vacation and
will not have email but would love to continue this dialog and am very
open to your suggestions below. See more comments therein.
Jeff
I was thinking of moving the display code into the examples directory.
I did that so Ted could use his favorite library but he has not been
pursuing it. I'm happy with blog and, as commons does not have the
needed sampling methods without Ted's patches, suggest we could go with
blog. Removing the plugability would clean up the code some too.
? Does this relate to maven?
Boy, I would too. Especially if it was clear enough that I could
understand it
Some of those were terms Ted introduced from my original port of his R
example. I'm not hung up but perhaps we should include him in the
discussion?
First of all: Great work, Jeff. I finally found some time to have a closer look at the code. To me it already looks pretty clear and easy to understand. Some minor comments:
I did not have a close look at the code for displaying the clustering process so far. If it is to be retained in the final version it might be a good idea to move that into its own package?
I was wondering why you wrapped two math packages (blog and commonsmath). Maybe it helps if you shortly name advantages and shortcomings of either? I was missing a pointer to Ted's patch to commons math. Maybe I just overlooked it?
In the patch I am missing changes to the dependencies of the pom.xml. I guess you would check in the libraries the patch is depending on into our libs directory? On first sight the license of BLOG as well its dependencies seems fine, any one else verified this?
Me personally, I would love to see the mathematical fomulae behind the implementation in the docs as well, maybe a pointer to a book chapter/ publication or other source that explains the algorithm in more detail.
Looking through the code, I found it a little irritating to have classes ModelDistribution, NormalModelDistribution, and DirichletDistribution around. As the only method in the interface ModelDistribution is sampleFromPrior, it might be clearer if it were named ModelSampler? But maybe it is just the time of day I looked at it...
This patch fixes a randomization problem caused by using two Random instances and adds several display tests that were used to generate the examples in the wiki.
The above patch makes several improvements to the above:
 refactored state updating and cluster sampling into DirichletState
 refactored creating list of models into ModelDistribution
 refactored state parameters from DirichletCluster to DirichletState
 refactored count into the model
 changed list<Model> to Model[]
 added significance filtering to print out
 increased number of iterations to 30 to demonstrate better convergence
The algorithm now produces the following output when run over 10,000 points:
 Using fixed random seed for repeatability.
 testDirichletCluster10000
 Generating 4000 samples m=[1.0, 1.0] sd=3.0
 Generating 3000 samples m=[1.0, 0.0] sd=0.1
 Generating 3000 samples m=[0.0, 1.0] sd=0.1
 sample[0]= normal(n=4037 m=[0.80, 0.73] sd=1.40), normal(n=3844 m=[0.51, 0.51] sd=0.68), normal(n=1092 m=[0.51, 0.47] sd=0.53), normal(n=794 m=[1.26, 1.60] sd=2.22),
 sample[1]= normal(n=4562 m=[0.72, 0.68] sd=1.25), normal(n=2992 m=[0.48, 0.52] sd=0.58), normal(n=1022 m=[0.67, 0.31] sd=0.53), normal(n=1227 m=[1.17, 1.41] sd=2.13),
 sample[2]= normal(n=4377 m=[0.66, 0.61] sd=1.08), normal(n=2592 m=[0.28, 0.71] sd=0.51), normal(n=1057 m=[1.04, 0.06] sd=0.25), normal(n=1831 m=[1.15, 1.26] sd=2.05),
 sample[3]= normal(n=4302 m=[0.74, 0.36] sd=0.80), normal(n=2075 m=[0.00, 1.01] sd=0.32), normal(n=793 m=[1.04, 0.05] sd=0.20), normal(n=2694 m=[1.04, 1.17] sd=1.93),
 sample[4]= normal(n=3602 m=[0.80, 0.21] sd=0.58), normal(n=1923 m=[0.05, 1.05] sd=0.26), normal(n=621 m=[1.03, 0.06] sd=0.19), normal(n=3677 m=[0.94, 1.09] sd=1.77),
This patch is a complete implementation of a nonM/R Dirichlet Process clustering algorithm. It has been refactored significantly from the above. I will describe these changes in a subsequent posting. I'd like to commit this to trunk in time for 0.1 if people agree.
I refactored again and was able eliminate materializing of the posterior data sets by adding observe() and computeParameters() operations to Model. The idea is that all models begin in their prior state and are asked to observe each sample that is assigned to them. Then, before pdf() is called on them in the next iteration a call to computeParameters() finalizes the parameters once and turns the model into a posterior model. I also compute counts on the fly to eliminate materializing z altogether. I hope I didn't throw the baby out with the bath water.
Finally, I introduced a DirichletState bean to hold the models, dirichlet distribution and the mixture, simplifying the arguments and, I think, fixing a bug in the earlier refactoring. The algorithm runs over 10,000 points and produces the following outputs (prior() indicates a model with no observations, n is the number of observations, m the mean and sd the std):
Generating 4000 samples m=[1.0, 1.0] sd=3.0
Generating 3000 samples m=[1.0, 0.0] sd=0.1
Generating 3000 samples m=[0.0, 1.0] sd=0.1
 sample[0]= [prior(), normal(n=6604 m=[0.67, 0.63] sd=1.11), normal(n=86 m=[0.77, 2.81] sd=2.15), prior(), normal(n=242 m=[2.89, 1.67] sd=2.14), normal(n=2532 m=[0.53, 0.55] sd=0.69), normal(n=339 m=[0.99, 1.70] sd=2.18), normal(n=77 m=[0.53, 0.47] sd=0.51), normal(n=119 m=[0.36, 0.47] sd=2.85), normal(n=1 m=[0.00, 0.00] sd=0.33)]
 sample[1]= [prior(), normal(n=6626 m=[0.62, 0.54] sd=0.91), normal(n=137 m=[0.51, 2.99] sd=1.56), normal(n=2 m=[0.57, 0.25] sd=0.70), normal(n=506 m=[2.55, 0.93] sd=1.73), normal(n=1573 m=[0.38, 0.60] sd=0.50), normal(n=848 m=[0.81, 1.59] sd=2.11), normal(n=67 m=[0.76, 0.31] sd=0.45), normal(n=240 m=[0.73, 0.31] sd=2.24), normal(n=1 m=[0.00, 0.00] sd=0.98)]
 sample[2]= [prior(), normal(n=5842 m=[0.67, 0.39] sd=0.73), normal(n=157 m=[0.73, 3.12] sd=1.14), prior(), normal(n=655 m=[2.32, 0.64] sd=1.60), normal(n=1439 m=[0.00, 1.00] sd=0.33), normal(n=1439 m=[0.78, 1.53] sd=1.89), normal(n=66 m=[0.96, 0.04] sd=0.24), normal(n=399 m=[0.63, 0.03] sd=1.99), normal(n=3 m=[0.07, 0.76] sd=0.41)]
/** * A model is a probability distribution over observed data points and allows * the probability of any data point to be computed. */ public interface Model<Observation> { /** * Observe the given observation, retaining information about it * * @param x an Observation from the posterior */ public abstract void observe(Observation x); /** * Compute a new set of posterior parameters based upon the Observations * that have been observed since my creation */ public abstract void computeParameters(); /** * Return the probability that the observation is described by this model * * @param x an Observation from the posterior * @return the probability that x is in z */ public abstract double pdf(Observation x); }
/** * Initialize the variables and run the iterations to assign the sample data * points to a computed number of clusters * * @return a List<List<Model<Observation>>> of the observed models */ public List<List<Model<Observation>>> dirichletCluster() { DirichletState<Observation> state = initializeState(); // create a posterior sample list to collect results List<List<Model<Observation>>> clusterSamples = new ArrayList<List<Model<Observation>>>(); // now iterate for (int iteration = 0; iteration < maxIterations; iteration++) iterate(state, iteration, clusterSamples); return clusterSamples; } /** * Initialize the state of the computation * * @return the DirichletState */ private DirichletState<Observation> initializeState() { // get initial prior models List<Model<Observation>> models = createPriorModels(); // create the initial distribution. DirichletDistribution distribution = new DirichletDistribution(maxClusters, alpha_0, dist); // mixture parameters are sampled from the Dirichlet distribution. Vector mixture = distribution.sample(); return new DirichletState<Observation>(models, distribution, mixture); } /** * Create a list of prior models * @return the Observation */ private List<Model<Observation>> createPriorModels() { List<Model<Observation>> models = new ArrayList<Model<Observation>>(); for (int k = 0; k < maxClusters; k++) { models.add(modelFactory.sampleFromPrior()); } return models; } /** * Perform one iteration of the clustering process, updating the state for the next iteration * @param state the DirichletState<Observation> of this iteration * @param iteration the int iteration number * @param clusterSamples a List<List<Model<Observation>>> that will be modified in each iteration */ private void iterate(DirichletState<Observation> state, int iteration, List<List<Model<Observation>>> clusterSamples) { // create new prior models List<Model<Observation>> newModels = createPriorModels(); // initialize vector of membership counts Vector counts = new DenseVector(maxClusters); counts.assign(alpha_0 / maxClusters); // iterate over the samples for (Observation x : sampleData) { // compute vector of probabilities x is described by each model Vector pi = computeProbabilities(state, x); // then pick one cluster by sampling a Multinomial distribution based upon them // see: http://en.wikipedia.org/wiki/Multinomial_distribution int model = dist.rmultinom(pi); // ask the selected model to observe the datum newModels.get(model).observe(x); // record counts for the model counts.set(model, counts.get(model) + 1); } // compute new model parameters based upon observations for (Model<Observation> m : newModels) m.computeParameters(); // update the state from the new models and counts state.distribution.setAlpha(counts); state.mixture = state.distribution.sample(); state.models = newModels; // periodically add models to cluster samples after getting started if ((iteration > burnin) && (iteration % thin == 0)) clusterSamples.add(state.models); } /** * Compute a normalized vector of probabilities that x is described * by each model using the mixture and the model pdfs * * @param state the DirichletState<Observation> of this iteration * @param x an Observation * @return the Vector of probabilities */ private Vector computeProbabilities(DirichletState<Observation> state, Observation x) { Vector pi = new DenseVector(maxClusters); double max = 0; for (int k = 0; k < maxClusters; k++) { double p = state.mixture.get(k) * state.models.get(k).pdf(x); pi.set(k, p); if (max < p) max = p; } // normalize the probabilities by largest observed value pi.assign(new TimesFunction(), 1.0 / max); return pi; }
Jeff,
These look like really nice refactorings. The process is nice and clear.
The only key trick that may confuse people is that each step is a sampling. Thus assignment to clusters does NOT assign to the best cluster, it picks a cluster at random, biased by the mixture parameters and model pdf's. Likewise, model computation does NOT compute the best model, it samples from the distribution given by the data. Same is true for the mixture parameters.
Your code does this. I just think that this is a hard point for people to understand in these techniques.
I did some refactoring to better localize the major processing steps in DirichletCluster. The first refactoring was to create an iterate method that handles each iteration after the initialization phase initializes the models, Dirichlet and mixture. Then I factored out the three subprocessing steps: assignPointsToModels, recomputeModels and computeNewMixtureParameters and moved some of the instance variables to locals. Looking at the result it is starting to feel more like the other clustering algorithms so I'm starting to think in terms of an M/R partitioning.
 In assignPointsToModels, we are iterating over all of the points, calculating their pdf with respect to the current clusters (models), applying the mixture probabilities and then using a multinomial sampling to assign them to a single cluster for this iteration. In a subsequent iteration we might assign them to different models, depending upon the probabilities.
 In recomputeModels, we use the model assignments from this iteration to compute the posterior samples for each of the models. Then we generate a new model that best describes this subsample. The ahha for me here is that we create entirely new models in each iteration.
 In iterate, we periodically output the models so we can gain whatever wisdom we wish from their parameters. I didn't factor that out but may later.
 in computeNewMixtureParameters we use the same model assignments to compute a new alpha vector for the Dirichlet distribution and a new mixture vector too. Then we iterate over the points again, continually refining these values. The ahha for me here is we do not actually remember the model assignments. I think, after the models have all been computed, we could run another pass over the points computing the pdf for each cluster as a probability vector. We do this in Kmeans too.
The first step seems very similar to the KmeansMapper: we read the Dirichlet, mixture and the current models into each mapper, then compute the multinomial and output the point keyed by its model index to the reducer. In a single reducer, all of the points are seen so we can calculate the posterior statistics if we switch the model to use a running sums (s0, s1, s2) algorithm for the std. Following this approach, it might be we could use a combiner to maintain the running sums (assuming Hadoop only runs it once for now) thus reducing the heavy load on the poor reducer. With a single reducer, I'm pretty sure we can compute the new mixture and Dirichlet parameters. With multiple reducers I think those calculations would need help combing the reducer values in the driver.
Ted, I know you've thought about this a lot, do these steps sound reasonable?
/** * Perform one iteration of the clustering process, updating the Dirichlet distribution * and returning the new mixtures * * @param models a List<Model<Vector>> of models * @param dir a DirichletDistribution that will be modified in each iteration * @param mixture a Vector mixture * @param clusterSamples a List<List<Model<Vector>>> that will be modified in each iteration * @param iteration the int iteration number * @return a new mixture Vector */ private Vector iterate(List<Model<Vector>> models, DirichletDistribution dir, Vector mixture, List<List<Model<Vector>>> clusterSamples, int iteration) { // z holds the assignment of cluster numbers to points for this iteration Vector z = assignPointsToModels(models, mixture); models = recomputeModels(z); // periodically add models to cluster samples after getting started if ((iteration > burnin) && (iteration % thin == 0)) clusterSamples.add(models); // compute and return the new mixture parameters (adjusting // the Dirichlet dir's alpha vector in the process return computeNewMixtureParameters(dir, z); } /** * Assign all points to a model based upon the current mixture parameters and * the probability that each point is described by each model's pdf. * * @param models the current List<Model<Vector>> of models * @param mixture the Vector of mixture probabilities * @return z the Vector of assignment of points to models */ private Vector assignPointsToModels(List<Model<Vector>> models, Vector mixture) { // z holds the assignment of cluster numbers to points for this iteration Vector z = new DenseVector(sampleData.size()); Vector pi = new DenseVector(maxClusters); for (int ix = 0; ix < sampleData.size(); i++) { Vector x = sampleData.get(ix); // compute probabilities for each cluster, saving the largest double max = 0; for (int k = 0; k < maxClusters; k++) { double p = mixture.get(k) * models.get(k).pdf(x); pi.set(k, p); if (max < p) max = p; } // normalize the probabilities by largest observed value pi.assign(new TimesFunction(), 1.0 / max); // then pick one cluster by sampling a multinomial based upon the probabilities int model = dist.rmultinom(pi); z.set(i, model); } return z; } /** * Recompute new models based upon the assignment of points to models and return them * * @param z the Vector of assignment of points to models * @return the List<Model<Vector>> of models */ private List<Model<Vector>> recomputeModels(Vector z) { List<Model<Vector>> newModels = new ArrayList<Model<Vector>>(); for (int k = 0; k < maxClusters; k++) { // collect all data assigned to each cluster List<Vector> data = new ArrayList<Vector>(); for (int i = 0; i < sampleData.size(); i++) if (z.get(i) == k) data.add(sampleData.get(i)); // add a new posterior model if any data else use a prior model if (data.size() > 0) newModels.add(modelFactory.sampleFromPosterior(data)); else newModels.add(modelFactory.sampleFromPrior()); } return newModels; } /** * Compute a new mixture based upon the Dirichlet distribution and the assignment of points to models. * * @param dir the DirichletDistribution, which is modified during this process * @param z the Vector of assignment of points to models * @return the new mixture Vector */ private Vector computeNewMixtureParameters(DirichletDistribution dir, Vector z) { Vector mixture; Vector a = new DenseVector(maxClusters); a.assign(alpha_0 / maxClusters); for (int i = 0; i < z.size(); i++) { int z_i = (int) z.get(i); a.set(z_i, a.get(z_i) + 1); } dir.setAlpha(a); mixture = dir.sample(); return mixture; }
Here's a workinprogress Dirichlet Process Clustering algorithm that Ted Dunning has been coaching me to write. It originated from an R prototype which he wrote and which I translated into Java using our nascent Vector package. Then Ted took the Java and refactored it to introduce better abstractions, as the R implementation had no objects and my reverseengineered abstractions were rather clunky. Finally, I got the new implementation working with a pluggable distributions framework based either upon commonsmath (+ Ted's patches thereto) or the blog0.2 framework (vanilla).
I am posting this to the list in hopes of generating more interest from the larger Mahout community. It has taken me literally months to wrap my mind around this approach. Enjoy.
To run this patch you will need to get the blog package at http://people.csail.mit.edu/milch/blog. Here is the beginning of the README file that came with the distribution:
=====
Bayesian Logic (BLOG) Inference Engine version 0.2
Copyright (c) 2007, Massachusetts Institute of Technology
Copyright (c) 2005, 2006, Regents of the University of California
All rights reserved. This software is distributed under the license
included in LICENSE.txt.
Lead author: Brian Milch, milch@csail.mit.edu
Supervisors: Prof. Stuart Russell (Berkeley), Prof. Leslie Kaelbling (MIT)
Contributors: Bhaskara Marthi, Andrey Kolobov, David Sontag, Daniel L. Ong,
Brendan Clark
=====
Jeff
Committed revision 754797. Committed code was refactored slightly from last patch (models moved from test to core), and pom was updated to reflect Gson dependency.