Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.3
    • Component/s: Clustering
    • Labels:
      None

      Description

      Copied over from original issue:

      > Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly)
      > non-parametric clustering algorithm.

      1. MAHOUT-30.patch
        33 kB
        Jeff Eastman
      2. MAHOUT-30b.patch
        45 kB
        Jeff Eastman
      3. MAHOUT-30c.patch
        72 kB
        Jeff Eastman
      4. MAHOUT-30d.patch
        66 kB
        Jeff Eastman
      5. MAHOUT-30e.patch
        70 kB
        Jeff Eastman
      6. dirichlet-2.tar
        690 kB
        Jeff Eastman
      7. screenshot-1.jpg
        149 kB
        Jeff Eastman
      8. MAHOUT-30f.patch
        143 kB
        Jeff Eastman

        Issue Links

          Activity

          Hide
          Jeff Eastman added a comment -

          Here's a work-in-progress Dirichlet Process Clustering algorithm that Ted Dunning has been coaching me to write. It originated from an R prototype which he wrote and which I translated into Java using our nascent Vector package. Then Ted took the Java and refactored it to introduce better abstractions, as the R implementation had no objects and my reverse-engineered abstractions were rather clunky. Finally, I got the new implementation working with a pluggable distributions framework based either upon commons-math (+ Ted's patches thereto) or the blog-0.2 framework (vanilla).

          I am posting this to the list in hopes of generating more interest from the larger Mahout community. It has taken me literally months to wrap my mind around this approach. Enjoy.

          To run this patch you will need to get the blog package at http://people.csail.mit.edu/milch/blog. Here is the beginning of the README file that came with the distribution:
          =====
          Bayesian Logic (BLOG) Inference Engine version 0.2

          Copyright (c) 2007, Massachusetts Institute of Technology
          Copyright (c) 2005, 2006, Regents of the University of California
          All rights reserved. This software is distributed under the license
          included in LICENSE.txt.

          Lead author: Brian Milch, milch@csail.mit.edu
          Supervisors: Prof. Stuart Russell (Berkeley), Prof. Leslie Kaelbling (MIT)
          Contributors: Bhaskara Marthi, Andrey Kolobov, David Sontag, Daniel L. Ong,
          Brendan Clark
          =====

          Jeff

          Show
          Jeff Eastman added a comment - Here's a work-in-progress Dirichlet Process Clustering algorithm that Ted Dunning has been coaching me to write. It originated from an R prototype which he wrote and which I translated into Java using our nascent Vector package. Then Ted took the Java and refactored it to introduce better abstractions, as the R implementation had no objects and my reverse-engineered abstractions were rather clunky. Finally, I got the new implementation working with a pluggable distributions framework based either upon commons-math (+ Ted's patches thereto) or the blog-0.2 framework (vanilla). I am posting this to the list in hopes of generating more interest from the larger Mahout community. It has taken me literally months to wrap my mind around this approach. Enjoy. To run this patch you will need to get the blog package at http://people.csail.mit.edu/milch/blog . Here is the beginning of the README file that came with the distribution: ===== Bayesian Logic (BLOG) Inference Engine version 0.2 Copyright (c) 2007, Massachusetts Institute of Technology Copyright (c) 2005, 2006, Regents of the University of California All rights reserved. This software is distributed under the license included in LICENSE.txt. Lead author: Brian Milch, milch@csail.mit.edu Supervisors: Prof. Stuart Russell (Berkeley), Prof. Leslie Kaelbling (MIT) Contributors: Bhaskara Marthi, Andrey Kolobov, David Sontag, Daniel L. Ong, Brendan Clark ===== Jeff
          Hide
          Jeff Eastman added a comment -

          I did some refactoring to better localize the major processing steps in DirichletCluster. The first refactoring was to create an iterate method that handles each iteration after the initialization phase initializes the models, Dirichlet and mixture. Then I factored out the three sub-processing steps: assignPointsToModels, recomputeModels and computeNewMixtureParameters and moved some of the instance variables to locals. Looking at the result it is starting to feel more like the other clustering algorithms so I'm starting to think in terms of an M/R partitioning.

          • In assignPointsToModels, we are iterating over all of the points, calculating their pdf with respect to the current clusters (models), applying the mixture probabilities and then using a multinomial sampling to assign them to a single cluster for this iteration. In a subsequent iteration we might assign them to different models, depending upon the probabilities.
          • In recomputeModels, we use the model assignments from this iteration to compute the posterior samples for each of the models. Then we generate a new model that best describes this sub-sample. The ah-ha for me here is that we create entirely new models in each iteration.
          • In iterate, we periodically output the models so we can gain whatever wisdom we wish from their parameters. I didn't factor that out but may later.
          • in computeNewMixtureParameters we use the same model assignments to compute a new alpha vector for the Dirichlet distribution and a new mixture vector too. Then we iterate over the points again, continually refining these values. The ah-ha for me here is we do not actually remember the model assignments. I think, after the models have all been computed, we could run another pass over the points computing the pdf for each cluster as a probability vector. We do this in Kmeans too.

          The first step seems very similar to the KmeansMapper: we read the Dirichlet, mixture and the current models into each mapper, then compute the multinomial and output the point keyed by its model index to the reducer. In a single reducer, all of the points are seen so we can calculate the posterior statistics if we switch the model to use a running sums (s0, s1, s2) algorithm for the std. Following this approach, it might be we could use a combiner to maintain the running sums (assuming Hadoop only runs it once for now) thus reducing the heavy load on the poor reducer. With a single reducer, I'm pretty sure we can compute the new mixture and Dirichlet parameters. With multiple reducers I think those calculations would need help combing the reducer values in the driver.

          Ted, I know you've thought about this a lot, do these steps sound reasonable?

          DirichletCluster.java
           /**
             * Perform one iteration of the clustering process, updating the Dirichlet distribution
             * and returning the new mixtures
             * 
             * @param models a List<Model<Vector>> of models
             * @param dir a DirichletDistribution that will be modified in each iteration
             * @param mixture a Vector mixture 
             * @param clusterSamples a List<List<Model<Vector>>> that will be modified in each iteration
             * @param iteration the int iteration number
             * @return a new mixture Vector
             */
            private Vector iterate(List<Model<Vector>> models, DirichletDistribution dir,
                Vector mixture, List<List<Model<Vector>>> clusterSamples, int iteration) {
              // z holds the assignment of cluster numbers to points for this iteration
              Vector z = assignPointsToModels(models, mixture);
              models = recomputeModels(z);
          
              // periodically add models to cluster samples after getting started
              if ((iteration > burnin) && (iteration % thin == 0))
                clusterSamples.add(models);
          
              // compute and return the new mixture parameters (adjusting 
              // the Dirichlet dir's alpha vector in the process
              return computeNewMixtureParameters(dir, z);
            }
          
            /**
             * Assign all points to a model based upon the current mixture parameters and
             * the probability that each point is described by each model's pdf.
             * 
             * @param models the current List<Model<Vector>> of models
             * @param mixture the Vector of mixture probabilities
             * @return z the Vector of assignment of points to models
             */
            private Vector assignPointsToModels(List<Model<Vector>> models, Vector mixture) {
              // z holds the assignment of cluster numbers to points for this iteration
              Vector z = new DenseVector(sampleData.size());
          
              Vector pi = new DenseVector(maxClusters);
              for (int ix = 0; ix < sampleData.size(); i++) {
                Vector x = sampleData.get(ix);
                // compute probabilities for each cluster, saving the largest
                double max = 0;
                for (int k = 0; k < maxClusters; k++) {
                  double p = mixture.get(k) * models.get(k).pdf(x);
                  pi.set(k, p);
                  if (max < p)
                    max = p;
                }
                // normalize the probabilities by largest observed value
                pi.assign(new TimesFunction(), 1.0 / max);
                // then pick one cluster by sampling a multinomial based upon the probabilities
                int model = dist.rmultinom(pi);
                z.set(i, model);
              }
              return z;
            }
          
            /**
             * Recompute new models based upon the assignment of points to models and return them
             * 
             * @param z the Vector of assignment of points to models
             * @return the List<Model<Vector>> of models
             */
            private List<Model<Vector>> recomputeModels(Vector z) {
              List<Model<Vector>> newModels = new ArrayList<Model<Vector>>();
              for (int k = 0; k < maxClusters; k++) {
                // collect all data assigned to each cluster
                List<Vector> data = new ArrayList<Vector>();
                for (int i = 0; i < sampleData.size(); i++)
                  if (z.get(i) == k)
                    data.add(sampleData.get(i));
                // add a new posterior model if any data else use a prior model
                if (data.size() > 0)
                  newModels.add(modelFactory.sampleFromPosterior(data));
                else
                  newModels.add(modelFactory.sampleFromPrior());
              }
              return newModels;
            }
          
            /**
             * Compute a new mixture based upon the Dirichlet distribution and the assignment of points to models.
             * 
             * @param dir the DirichletDistribution, which is modified during this process
             * @param z the Vector of assignment of points to models
             * @return the new mixture Vector
             */
            private Vector computeNewMixtureParameters(DirichletDistribution dir, Vector z) {
              Vector mixture;
              Vector a = new DenseVector(maxClusters);
              a.assign(alpha_0 / maxClusters);
              for (int i = 0; i < z.size(); i++) {
                int z_i = (int) z.get(i);
                a.set(z_i, a.get(z_i) + 1);
              }
              dir.setAlpha(a);
              mixture = dir.sample();
              return mixture;
            }
          
          Show
          Jeff Eastman added a comment - I did some refactoring to better localize the major processing steps in DirichletCluster. The first refactoring was to create an iterate method that handles each iteration after the initialization phase initializes the models, Dirichlet and mixture. Then I factored out the three sub-processing steps: assignPointsToModels, recomputeModels and computeNewMixtureParameters and moved some of the instance variables to locals. Looking at the result it is starting to feel more like the other clustering algorithms so I'm starting to think in terms of an M/R partitioning. In assignPointsToModels, we are iterating over all of the points, calculating their pdf with respect to the current clusters (models), applying the mixture probabilities and then using a multinomial sampling to assign them to a single cluster for this iteration. In a subsequent iteration we might assign them to different models, depending upon the probabilities. In recomputeModels, we use the model assignments from this iteration to compute the posterior samples for each of the models. Then we generate a new model that best describes this sub-sample. The ah-ha for me here is that we create entirely new models in each iteration. In iterate, we periodically output the models so we can gain whatever wisdom we wish from their parameters. I didn't factor that out but may later. in computeNewMixtureParameters we use the same model assignments to compute a new alpha vector for the Dirichlet distribution and a new mixture vector too. Then we iterate over the points again, continually refining these values. The ah-ha for me here is we do not actually remember the model assignments. I think, after the models have all been computed, we could run another pass over the points computing the pdf for each cluster as a probability vector. We do this in Kmeans too. The first step seems very similar to the KmeansMapper: we read the Dirichlet, mixture and the current models into each mapper, then compute the multinomial and output the point keyed by its model index to the reducer. In a single reducer, all of the points are seen so we can calculate the posterior statistics if we switch the model to use a running sums (s0, s1, s2) algorithm for the std. Following this approach, it might be we could use a combiner to maintain the running sums (assuming Hadoop only runs it once for now) thus reducing the heavy load on the poor reducer. With a single reducer, I'm pretty sure we can compute the new mixture and Dirichlet parameters. With multiple reducers I think those calculations would need help combing the reducer values in the driver. Ted, I know you've thought about this a lot, do these steps sound reasonable? DirichletCluster.java /** * Perform one iteration of the clustering process, updating the Dirichlet distribution * and returning the new mixtures * * @param models a List<Model<Vector>> of models * @param dir a DirichletDistribution that will be modified in each iteration * @param mixture a Vector mixture * @param clusterSamples a List<List<Model<Vector>>> that will be modified in each iteration * @param iteration the int iteration number * @ return a new mixture Vector */ private Vector iterate(List<Model<Vector>> models, DirichletDistribution dir, Vector mixture, List<List<Model<Vector>>> clusterSamples, int iteration) { // z holds the assignment of cluster numbers to points for this iteration Vector z = assignPointsToModels(models, mixture); models = recomputeModels(z); // periodically add models to cluster samples after getting started if ((iteration > burnin) && (iteration % thin == 0)) clusterSamples.add(models); // compute and return the new mixture parameters (adjusting // the Dirichlet dir's alpha vector in the process return computeNewMixtureParameters(dir, z); } /** * Assign all points to a model based upon the current mixture parameters and * the probability that each point is described by each model's pdf. * * @param models the current List<Model<Vector>> of models * @param mixture the Vector of mixture probabilities * @ return z the Vector of assignment of points to models */ private Vector assignPointsToModels(List<Model<Vector>> models, Vector mixture) { // z holds the assignment of cluster numbers to points for this iteration Vector z = new DenseVector(sampleData.size()); Vector pi = new DenseVector(maxClusters); for ( int ix = 0; ix < sampleData.size(); i++) { Vector x = sampleData.get(ix); // compute probabilities for each cluster, saving the largest double max = 0; for ( int k = 0; k < maxClusters; k++) { double p = mixture.get(k) * models.get(k).pdf(x); pi.set(k, p); if (max < p) max = p; } // normalize the probabilities by largest observed value pi.assign( new TimesFunction(), 1.0 / max); // then pick one cluster by sampling a multinomial based upon the probabilities int model = dist.rmultinom(pi); z.set(i, model); } return z; } /** * Recompute new models based upon the assignment of points to models and return them * * @param z the Vector of assignment of points to models * @ return the List<Model<Vector>> of models */ private List<Model<Vector>> recomputeModels(Vector z) { List<Model<Vector>> newModels = new ArrayList<Model<Vector>>(); for ( int k = 0; k < maxClusters; k++) { // collect all data assigned to each cluster List<Vector> data = new ArrayList<Vector>(); for ( int i = 0; i < sampleData.size(); i++) if (z.get(i) == k) data.add(sampleData.get(i)); // add a new posterior model if any data else use a prior model if (data.size() > 0) newModels.add(modelFactory.sampleFromPosterior(data)); else newModels.add(modelFactory.sampleFromPrior()); } return newModels; } /** * Compute a new mixture based upon the Dirichlet distribution and the assignment of points to models. * * @param dir the DirichletDistribution, which is modified during this process * @param z the Vector of assignment of points to models * @ return the new mixture Vector */ private Vector computeNewMixtureParameters(DirichletDistribution dir, Vector z) { Vector mixture; Vector a = new DenseVector(maxClusters); a.assign(alpha_0 / maxClusters); for ( int i = 0; i < z.size(); i++) { int z_i = ( int ) z.get(i); a.set(z_i, a.get(z_i) + 1); } dir.setAlpha(a); mixture = dir.sample(); return mixture; }
          Hide
          Ted Dunning added a comment -

          Jeff,

          These look like really nice refactorings. The process is nice and clear.

          The only key trick that may confuse people is that each step is a sampling. Thus assignment to clusters does NOT assign to the best cluster, it picks a cluster at random, biased by the mixture parameters and model pdf's. Likewise, model computation does NOT compute the best model, it samples from the distribution given by the data. Same is true for the mixture parameters.

          Your code does this. I just think that this is a hard point for people to understand in these techniques.

          Show
          Ted Dunning added a comment - Jeff, These look like really nice refactorings. The process is nice and clear. The only key trick that may confuse people is that each step is a sampling. Thus assignment to clusters does NOT assign to the best cluster, it picks a cluster at random, biased by the mixture parameters and model pdf's. Likewise, model computation does NOT compute the best model, it samples from the distribution given by the data. Same is true for the mixture parameters. Your code does this. I just think that this is a hard point for people to understand in these techniques.
          Hide
          Jeff Eastman added a comment - - edited

          I refactored again and was able eliminate materializing of the posterior data sets by adding observe() and computeParameters() operations to Model. The idea is that all models begin in their prior state and are asked to observe each sample that is assigned to them. Then, before pdf() is called on them in the next iteration a call to computeParameters() finalizes the parameters once and turns the model into a posterior model. I also compute counts on the fly to eliminate materializing z altogether. I hope I didn't throw the baby out with the bath water.

          Finally, I introduced a DirichletState bean to hold the models, dirichlet distribution and the mixture, simplifying the arguments and, I think, fixing a bug in the earlier refactoring. The algorithm runs over 10,000 points and produces the following outputs (prior() indicates a model with no observations, n is the number of observations, m the mean and sd the std):

          Generating 4000 samples m=[1.0, 1.0] sd=3.0
          Generating 3000 samples m=[1.0, 0.0] sd=0.1
          Generating 3000 samples m=[0.0, 1.0] sd=0.1

          • sample[0]= [prior(), normal(n=6604 m=[0.67, 0.63] sd=1.11), normal(n=86 m=[0.77, 2.81] sd=2.15), prior(), normal(n=242 m=[2.89, 1.67] sd=2.14), normal(n=2532 m=[0.53, 0.55] sd=0.69), normal(n=339 m=[0.99, 1.70] sd=2.18), normal(n=77 m=[0.53, 0.47] sd=0.51), normal(n=119 m=[0.36, 0.47] sd=2.85), normal(n=1 m=[0.00, 0.00] sd=0.33)]
          • sample[1]= [prior(), normal(n=6626 m=[0.62, 0.54] sd=0.91), normal(n=137 m=[0.51, 2.99] sd=1.56), normal(n=2 m=[0.57, 0.25] sd=0.70), normal(n=506 m=[2.55, 0.93] sd=1.73), normal(n=1573 m=[0.38, 0.60] sd=0.50), normal(n=848 m=[0.81, 1.59] sd=2.11), normal(n=67 m=[0.76, 0.31] sd=0.45), normal(n=240 m=[0.73, 0.31] sd=2.24), normal(n=1 m=[0.00, 0.00] sd=0.98)]
          • sample[2]= [prior(), normal(n=5842 m=[0.67, 0.39] sd=0.73), normal(n=157 m=[0.73, 3.12] sd=1.14), prior(), normal(n=655 m=[2.32, 0.64] sd=1.60), normal(n=1439 m=[0.00, 1.00] sd=0.33), normal(n=1439 m=[0.78, 1.53] sd=1.89), normal(n=66 m=[0.96, -0.04] sd=0.24), normal(n=399 m=[0.63, -0.03] sd=1.99), normal(n=3 m=[-0.07, 0.76] sd=0.41)]
          Model
          /**
           * A model is a probability distribution over observed data points and allows 
           * the probability of any data point to be computed.
           */
          public interface Model<Observation> {
            
            /**
             * Observe the given observation, retaining information about it
             * 
             * @param x an Observation from the posterior
             */
            public abstract void observe(Observation x);
            
            /**
             * Compute a new set of posterior parameters based upon the Observations 
             * that have been observed since my creation
             */
            public abstract void computeParameters();
          
            /**
            * Return the probability that the observation is described by this model
            * 
            * @param x an Observation from the posterior
            * @return the probability that x is in z
            */
            public abstract double pdf(Observation x);
          }
          
          DirichletCluster
            /**
             * Initialize the variables and run the iterations to assign the sample data
             * points to a computed number of clusters
             *
             * @return a List<List<Model<Observation>>> of the observed models
             */
            public List<List<Model<Observation>>> dirichletCluster() {
              DirichletState<Observation> state = initializeState();
          
              // create a posterior sample list to collect results
              List<List<Model<Observation>>> clusterSamples = new ArrayList<List<Model<Observation>>>();
          
              // now iterate
              for (int iteration = 0; iteration < maxIterations; iteration++)
                iterate(state, iteration, clusterSamples);
          
              return clusterSamples;
            }
          
            /**
             * Initialize the state of the computation
             * 
             * @return the DirichletState
             */
            private DirichletState<Observation> initializeState() {
              // get initial prior models
              List<Model<Observation>> models = createPriorModels();
              // create the initial distribution.
              DirichletDistribution distribution = new DirichletDistribution(maxClusters,
                  alpha_0, dist);
              // mixture parameters are sampled from the Dirichlet distribution. 
              Vector mixture = distribution.sample();
              return new DirichletState<Observation>(models, distribution, mixture);
            }
          
            /**
             * Create a list of prior models
             * @return the Observation
             */
            private List<Model<Observation>> createPriorModels() {
              List<Model<Observation>> models = new ArrayList<Model<Observation>>();
              for (int k = 0; k < maxClusters; k++) {
                models.add(modelFactory.sampleFromPrior());
              }
              return models;
            }
          
            /**
             * Perform one iteration of the clustering process, updating the state for the next iteration
             * @param state the DirichletState<Observation> of this iteration
             * @param iteration the int iteration number
             * @param clusterSamples a List<List<Model<Observation>>> that will be modified in each iteration
             */
            private void iterate(DirichletState<Observation> state, int iteration,
                List<List<Model<Observation>>> clusterSamples) {
          
              // create new prior models
              List<Model<Observation>> newModels = createPriorModels();
          
              // initialize vector of membership counts
              Vector counts = new DenseVector(maxClusters);
              counts.assign(alpha_0 / maxClusters);
          
              // iterate over the samples
              for (Observation x : sampleData) {
                // compute vector of probabilities x is described by each model
                Vector pi = computeProbabilities(state, x);
                // then pick one cluster by sampling a Multinomial distribution based upon them
                // see: http://en.wikipedia.org/wiki/Multinomial_distribution
                int model = dist.rmultinom(pi);
                // ask the selected model to observe the datum
                newModels.get(model).observe(x);
                // record counts for the model
                counts.set(model, counts.get(model) + 1);
              }
          
              // compute new model parameters based upon observations
              for (Model<Observation> m : newModels)
                m.computeParameters();
          
              // update the state from the new models and counts
              state.distribution.setAlpha(counts);
              state.mixture = state.distribution.sample();
              state.models = newModels;
          
              // periodically add models to cluster samples after getting started
              if ((iteration > burnin) && (iteration % thin == 0))
                clusterSamples.add(state.models);
            }
          
            /**
             * Compute a normalized vector of probabilities that x is described
             * by each model using the mixture and the model pdfs
             * 
             * @param state the DirichletState<Observation> of this iteration
             * @param x an Observation
             * @return the Vector of probabilities
             */
            private Vector computeProbabilities(DirichletState<Observation> state,
                Observation x) {
              Vector pi = new DenseVector(maxClusters);
              double max = 0;
              for (int k = 0; k < maxClusters; k++) {
                double p = state.mixture.get(k) * state.models.get(k).pdf(x);
                pi.set(k, p);
                if (max < p)
                  max = p;
              }
              // normalize the probabilities by largest observed value
              pi.assign(new TimesFunction(), 1.0 / max);
              return pi;
            }
          
          Show
          Jeff Eastman added a comment - - edited I refactored again and was able eliminate materializing of the posterior data sets by adding observe() and computeParameters() operations to Model . The idea is that all models begin in their prior state and are asked to observe each sample that is assigned to them. Then, before pdf() is called on them in the next iteration a call to computeParameters() finalizes the parameters once and turns the model into a posterior model. I also compute counts on the fly to eliminate materializing z altogether. I hope I didn't throw the baby out with the bath water. Finally, I introduced a DirichletState bean to hold the models, dirichlet distribution and the mixture, simplifying the arguments and, I think, fixing a bug in the earlier refactoring. The algorithm runs over 10,000 points and produces the following outputs (prior() indicates a model with no observations, n is the number of observations, m the mean and sd the std): Generating 4000 samples m= [1.0, 1.0] sd=3.0 Generating 3000 samples m= [1.0, 0.0] sd=0.1 Generating 3000 samples m= [0.0, 1.0] sd=0.1 sample [0] = [prior(), normal(n=6604 m= [0.67, 0.63] sd=1.11), normal(n=86 m= [0.77, 2.81] sd=2.15), prior(), normal(n=242 m= [2.89, 1.67] sd=2.14), normal(n=2532 m= [0.53, 0.55] sd=0.69), normal(n=339 m= [0.99, 1.70] sd=2.18), normal(n=77 m= [0.53, 0.47] sd=0.51), normal(n=119 m= [0.36, 0.47] sd=2.85), normal(n=1 m= [0.00, 0.00] sd=0.33)] sample [1] = [prior(), normal(n=6626 m= [0.62, 0.54] sd=0.91), normal(n=137 m= [0.51, 2.99] sd=1.56), normal(n=2 m= [0.57, 0.25] sd=0.70), normal(n=506 m= [2.55, 0.93] sd=1.73), normal(n=1573 m= [0.38, 0.60] sd=0.50), normal(n=848 m= [0.81, 1.59] sd=2.11), normal(n=67 m= [0.76, 0.31] sd=0.45), normal(n=240 m= [0.73, 0.31] sd=2.24), normal(n=1 m= [0.00, 0.00] sd=0.98)] sample [2] = [prior(), normal(n=5842 m= [0.67, 0.39] sd=0.73), normal(n=157 m= [0.73, 3.12] sd=1.14), prior(), normal(n=655 m= [2.32, 0.64] sd=1.60), normal(n=1439 m= [0.00, 1.00] sd=0.33), normal(n=1439 m= [0.78, 1.53] sd=1.89), normal(n=66 m= [0.96, -0.04] sd=0.24), normal(n=399 m= [0.63, -0.03] sd=1.99), normal(n=3 m= [-0.07, 0.76] sd=0.41)] Model /** * A model is a probability distribution over observed data points and allows * the probability of any data point to be computed. */ public interface Model<Observation> { /** * Observe the given observation, retaining information about it * * @param x an Observation from the posterior */ public abstract void observe(Observation x); /** * Compute a new set of posterior parameters based upon the Observations * that have been observed since my creation */ public abstract void computeParameters(); /** * Return the probability that the observation is described by this model * * @param x an Observation from the posterior * @ return the probability that x is in z */ public abstract double pdf(Observation x); } DirichletCluster /** * Initialize the variables and run the iterations to assign the sample data * points to a computed number of clusters * * @ return a List<List<Model<Observation>>> of the observed models */ public List<List<Model<Observation>>> dirichletCluster() { DirichletState<Observation> state = initializeState(); // create a posterior sample list to collect results List<List<Model<Observation>>> clusterSamples = new ArrayList<List<Model<Observation>>>(); // now iterate for ( int iteration = 0; iteration < maxIterations; iteration++) iterate(state, iteration, clusterSamples); return clusterSamples; } /** * Initialize the state of the computation * * @ return the DirichletState */ private DirichletState<Observation> initializeState() { // get initial prior models List<Model<Observation>> models = createPriorModels(); // create the initial distribution. DirichletDistribution distribution = new DirichletDistribution(maxClusters, alpha_0, dist); // mixture parameters are sampled from the Dirichlet distribution. Vector mixture = distribution.sample(); return new DirichletState<Observation>(models, distribution, mixture); } /** * Create a list of prior models * @ return the Observation */ private List<Model<Observation>> createPriorModels() { List<Model<Observation>> models = new ArrayList<Model<Observation>>(); for ( int k = 0; k < maxClusters; k++) { models.add(modelFactory.sampleFromPrior()); } return models; } /** * Perform one iteration of the clustering process, updating the state for the next iteration * @param state the DirichletState<Observation> of this iteration * @param iteration the int iteration number * @param clusterSamples a List<List<Model<Observation>>> that will be modified in each iteration */ private void iterate(DirichletState<Observation> state, int iteration, List<List<Model<Observation>>> clusterSamples) { // create new prior models List<Model<Observation>> newModels = createPriorModels(); // initialize vector of membership counts Vector counts = new DenseVector(maxClusters); counts.assign(alpha_0 / maxClusters); // iterate over the samples for (Observation x : sampleData) { // compute vector of probabilities x is described by each model Vector pi = computeProbabilities(state, x); // then pick one cluster by sampling a Multinomial distribution based upon them // see: http://en.wikipedia.org/wiki/Multinomial_distribution int model = dist.rmultinom(pi); // ask the selected model to observe the datum newModels.get(model).observe(x); // record counts for the model counts.set(model, counts.get(model) + 1); } // compute new model parameters based upon observations for (Model<Observation> m : newModels) m.computeParameters(); // update the state from the new models and counts state.distribution.setAlpha(counts); state.mixture = state.distribution.sample(); state.models = newModels; // periodically add models to cluster samples after getting started if ((iteration > burnin) && (iteration % thin == 0)) clusterSamples.add(state.models); } /** * Compute a normalized vector of probabilities that x is described * by each model using the mixture and the model pdfs * * @param state the DirichletState<Observation> of this iteration * @param x an Observation * @ return the Vector of probabilities */ private Vector computeProbabilities(DirichletState<Observation> state, Observation x) { Vector pi = new DenseVector(maxClusters); double max = 0; for ( int k = 0; k < maxClusters; k++) { double p = state.mixture.get(k) * state.models.get(k).pdf(x); pi.set(k, p); if (max < p) max = p; } // normalize the probabilities by largest observed value pi.assign( new TimesFunction(), 1.0 / max); return pi; }
          Hide
          Jeff Eastman added a comment -

          This patch is a complete implementation of a non-M/R Dirichlet Process clustering algorithm. It has been refactored significantly from the above. I will describe these changes in a subsequent posting. I'd like to commit this to trunk in time for 0.1 if people agree.

          Show
          Jeff Eastman added a comment - This patch is a complete implementation of a non-M/R Dirichlet Process clustering algorithm. It has been refactored significantly from the above. I will describe these changes in a subsequent posting. I'd like to commit this to trunk in time for 0.1 if people agree.
          Hide
          Jeff Eastman added a comment -

          The above patch makes several improvements to the above:

          • refactored state updating and cluster sampling into DirichletState
          • refactored creating list of models into ModelDistribution
          • refactored state parameters from DirichletCluster to DirichletState
          • refactored count into the model
          • changed list<Model> to Model[]
          • added significance filtering to print out
          • increased number of iterations to 30 to demonstrate better convergence

          The algorithm now produces the following output when run over 10,000 points:

          • Using fixed random seed for repeatability.
          • testDirichletCluster10000
          • Generating 4000 samples m=[1.0, 1.0] sd=3.0
          • Generating 3000 samples m=[1.0, 0.0] sd=0.1
          • Generating 3000 samples m=[0.0, 1.0] sd=0.1
          • sample[0]= normal(n=4037 m=[0.80, 0.73] sd=1.40), normal(n=3844 m=[0.51, 0.51] sd=0.68), normal(n=1092 m=[0.51, 0.47] sd=0.53), normal(n=794 m=[1.26, 1.60] sd=2.22),
          • sample[1]= normal(n=4562 m=[0.72, 0.68] sd=1.25), normal(n=2992 m=[0.48, 0.52] sd=0.58), normal(n=1022 m=[0.67, 0.31] sd=0.53), normal(n=1227 m=[1.17, 1.41] sd=2.13),
          • sample[2]= normal(n=4377 m=[0.66, 0.61] sd=1.08), normal(n=2592 m=[0.28, 0.71] sd=0.51), normal(n=1057 m=[1.04, -0.06] sd=0.25), normal(n=1831 m=[1.15, 1.26] sd=2.05),
          • sample[3]= normal(n=4302 m=[0.74, 0.36] sd=0.80), normal(n=2075 m=[-0.00, 1.01] sd=0.32), normal(n=793 m=[1.04, -0.05] sd=0.20), normal(n=2694 m=[1.04, 1.17] sd=1.93),
          • sample[4]= normal(n=3602 m=[0.80, 0.21] sd=0.58), normal(n=1923 m=[-0.05, 1.05] sd=0.26), normal(n=621 m=[1.03, -0.06] sd=0.19), normal(n=3677 m=[0.94, 1.09] sd=1.77),
          Show
          Jeff Eastman added a comment - The above patch makes several improvements to the above: refactored state updating and cluster sampling into DirichletState refactored creating list of models into ModelDistribution refactored state parameters from DirichletCluster to DirichletState refactored count into the model changed list<Model> to Model[] added significance filtering to print out increased number of iterations to 30 to demonstrate better convergence The algorithm now produces the following output when run over 10,000 points: Using fixed random seed for repeatability. testDirichletCluster10000 Generating 4000 samples m= [1.0, 1.0] sd=3.0 Generating 3000 samples m= [1.0, 0.0] sd=0.1 Generating 3000 samples m= [0.0, 1.0] sd=0.1 sample [0] = normal(n=4037 m= [0.80, 0.73] sd=1.40), normal(n=3844 m= [0.51, 0.51] sd=0.68), normal(n=1092 m= [0.51, 0.47] sd=0.53), normal(n=794 m= [1.26, 1.60] sd=2.22), sample [1] = normal(n=4562 m= [0.72, 0.68] sd=1.25), normal(n=2992 m= [0.48, 0.52] sd=0.58), normal(n=1022 m= [0.67, 0.31] sd=0.53), normal(n=1227 m= [1.17, 1.41] sd=2.13), sample [2] = normal(n=4377 m= [0.66, 0.61] sd=1.08), normal(n=2592 m= [0.28, 0.71] sd=0.51), normal(n=1057 m= [1.04, -0.06] sd=0.25), normal(n=1831 m= [1.15, 1.26] sd=2.05), sample [3] = normal(n=4302 m= [0.74, 0.36] sd=0.80), normal(n=2075 m= [-0.00, 1.01] sd=0.32), normal(n=793 m= [1.04, -0.05] sd=0.20), normal(n=2694 m= [1.04, 1.17] sd=1.93), sample [4] = normal(n=3602 m= [0.80, 0.21] sd=0.58), normal(n=1923 m= [-0.05, 1.05] sd=0.26), normal(n=621 m= [1.03, -0.06] sd=0.19), normal(n=3677 m= [0.94, 1.09] sd=1.77),
          Hide
          Jeff Eastman added a comment -

          This patch fixes a randomization problem caused by using two Random instances and adds several display tests that were used to generate the examples in the wiki.

          Show
          Jeff Eastman added a comment - This patch fixes a randomization problem caused by using two Random instances and adds several display tests that were used to generate the examples in the wiki.
          Hide
          Isabel Drost-Fromm added a comment -

          First of all: Great work, Jeff. I finally found some time to have a closer look at the code. To me it already looks pretty clear and easy to understand. Some minor comments:

          I did not have a close look at the code for displaying the clustering process so far. If it is to be retained in the final version it might be a good idea to move that into its own package?

          I was wondering why you wrapped two math packages (blog and commons-math). Maybe it helps if you shortly name advantages and shortcomings of either? I was missing a pointer to Ted's patch to commons math. Maybe I just overlooked it?

          In the patch I am missing changes to the dependencies of the pom.xml. I guess you would check in the libraries the patch is depending on into our libs directory? On first sight the license of BLOG as well its dependencies seems fine, any one else verified this?

          Me personally, I would love to see the mathematical fomulae behind the implementation in the docs as well, maybe a pointer to a book chapter/ publication or other source that explains the algorithm in more detail.

          Looking through the code, I found it a little irritating to have classes ModelDistribution, NormalModelDistribution, and DirichletDistribution around. As the only method in the interface ModelDistribution is sampleFromPrior, it might be clearer if it were named ModelSampler? But maybe it is just the time of day I looked at it...

          Show
          Isabel Drost-Fromm added a comment - First of all: Great work, Jeff. I finally found some time to have a closer look at the code. To me it already looks pretty clear and easy to understand. Some minor comments: I did not have a close look at the code for displaying the clustering process so far. If it is to be retained in the final version it might be a good idea to move that into its own package? I was wondering why you wrapped two math packages (blog and commons-math). Maybe it helps if you shortly name advantages and shortcomings of either? I was missing a pointer to Ted's patch to commons math. Maybe I just overlooked it? In the patch I am missing changes to the dependencies of the pom.xml. I guess you would check in the libraries the patch is depending on into our libs directory? On first sight the license of BLOG as well its dependencies seems fine, any one else verified this? Me personally, I would love to see the mathematical fomulae behind the implementation in the docs as well, maybe a pointer to a book chapter/ publication or other source that explains the algorithm in more detail. Looking through the code, I found it a little irritating to have classes ModelDistribution, NormalModelDistribution, and DirichletDistribution around. As the only method in the interface ModelDistribution is sampleFromPrior, it might be clearer if it were named ModelSampler? But maybe it is just the time of day I looked at it...
          Hide
          Jeff Eastman added a comment -

          Hi Isabel,

          I'm so happy you had time to look through the code. Getting it to this
          point was a great ordeal for me as the math is complicated and I have no
          formal statistics background. Ted's help was critical in getting me to
          the tipping point where I now understand the implementation well enough
          to make progress on my own. I'm getting ready for a week vacation and
          will not have email but would love to continue this dialog and am very
          open to your suggestions below. See more comments therein.

          Jeff

          I was thinking of moving the display code into the examples directory.
          I did that so Ted could use his favorite library but he has not been
          pursuing it. I'm happy with blog and, as commons does not have the
          needed sampling methods without Ted's patches, suggest we could go with
          blog. Removing the plugability would clean up the code some too.
          ? Does this relate to maven?
          Boy, I would too. Especially if it was clear enough that I could
          understand it
          Some of those were terms Ted introduced from my original port of his R
          example. I'm not hung up but perhaps we should include him in the
          discussion?

          Show
          Jeff Eastman added a comment - Hi Isabel, I'm so happy you had time to look through the code. Getting it to this point was a great ordeal for me as the math is complicated and I have no formal statistics background. Ted's help was critical in getting me to the tipping point where I now understand the implementation well enough to make progress on my own. I'm getting ready for a week vacation and will not have email but would love to continue this dialog and am very open to your suggestions below. See more comments therein. Jeff I was thinking of moving the display code into the examples directory. I did that so Ted could use his favorite library but he has not been pursuing it. I'm happy with blog and, as commons does not have the needed sampling methods without Ted's patches, suggest we could go with blog. Removing the plugability would clean up the code some too. ? Does this relate to maven? Boy, I would too. Especially if it was clear enough that I could understand it Some of those were terms Ted introduced from my original port of his R example. I'm not hung up but perhaps we should include him in the discussion?
          Hide
          Ted Dunning added a comment -

          Regarding the question of whether something should be called a Distribution or a Sampler, the mathematical terminology is that a distribution is something you can sample so the the Distribution terminology would be most compatible that way. The fact that only one method is currently defined is likely a temporary thing ... other methods could well be required for later efforts.

          On Fri, Dec 26, 2008 at 10:53 AM, Jeff Eastman (JIRA) <jira@apache.org> wrote:

          ... Some of those were terms Ted introduced from my original port of his R
          example. I'm not hung up but perhaps we should include him in the
          discussion?

          Show
          Ted Dunning added a comment - Regarding the question of whether something should be called a Distribution or a Sampler, the mathematical terminology is that a distribution is something you can sample so the the Distribution terminology would be most compatible that way. The fact that only one method is currently defined is likely a temporary thing ... other methods could well be required for later efforts. On Fri, Dec 26, 2008 at 10:53 AM, Jeff Eastman (JIRA) <jira@apache.org> wrote: ... Some of those were terms Ted introduced from my original port of his R example. I'm not hung up but perhaps we should include him in the discussion?
          Hide
          Ted Dunning added a comment - - edited

          One good reference is this relatively dense article by McCullagh and Yang:

          http://ba.stat.cmu.edu/journal/2008/vol03/issue01/yang.pdf

          There is also a more approachable example in Chris Bishop's book on Machine Learning. See http://research.microsoft.com/en-us/um/people/cmbishop/PRML/index.htm. I think that chapter 9 is where the example of clustering using a mixture model is found.

          The Neal and Blei references from the McCullagh and Yang paper are also good. Zoubin Gharamani has some very nice tutorials out which describe why non-parametric Bayesian approaches to problems are very cool. One is at http://learning.eng.cam.ac.uk/zoubin/talks/uai05tutorial-b.pdf but here are video versions about as well.

          Show
          Ted Dunning added a comment - - edited One good reference is this relatively dense article by McCullagh and Yang: http://ba.stat.cmu.edu/journal/2008/vol03/issue01/yang.pdf There is also a more approachable example in Chris Bishop's book on Machine Learning. See http://research.microsoft.com/en-us/um/people/cmbishop/PRML/index.htm . I think that chapter 9 is where the example of clustering using a mixture model is found. The Neal and Blei references from the McCullagh and Yang paper are also good. Zoubin Gharamani has some very nice tutorials out which describe why non-parametric Bayesian approaches to problems are very cool. One is at http://learning.eng.cam.ac.uk/zoubin/talks/uai05tutorial-b.pdf but here are video versions about as well.
          Hide
          Isabel Drost-Fromm added a comment -

          > Regarding the question of whether something should be called a Distribution or a Sampler, the mathematical terminology is that a distribution is
          > something you can sample so the the Distribution terminology would be most compatible that way. The fact that only one method is currently
          > defined is likely a temporary thing ... other methods could well be required for later efforts.

          I understand. I do not have any strong objections. I think, a short class comment in DirichletDistribution would already help to avoid at least my confusion. (Although not trying to understand code at 2a.m. local time might help as well... )

          > I was thinking of moving the display code into the examples directory.

          Sounds like a great idea to me.

          > I did that so Ted could use his favorite library but he has not been pursuing it. I'm happy with blog and, as commons does not have the
          > needed sampling methods without Ted's patches, suggest we could go with blog. Removing the plugability would clean up the code some too.

          Do you know what the current status of the patches is? I must admit I have a slight preference for commons-math as well, in case they support what we need.

          Show
          Isabel Drost-Fromm added a comment - > Regarding the question of whether something should be called a Distribution or a Sampler, the mathematical terminology is that a distribution is > something you can sample so the the Distribution terminology would be most compatible that way. The fact that only one method is currently > defined is likely a temporary thing ... other methods could well be required for later efforts. I understand. I do not have any strong objections. I think, a short class comment in DirichletDistribution would already help to avoid at least my confusion. (Although not trying to understand code at 2a.m. local time might help as well... ) > I was thinking of moving the display code into the examples directory. Sounds like a great idea to me. > I did that so Ted could use his favorite library but he has not been pursuing it. I'm happy with blog and, as commons does not have the > needed sampling methods without Ted's patches, suggest we could go with blog. Removing the plugability would clean up the code some too. Do you know what the current status of the patches is? I must admit I have a slight preference for commons-math as well, in case they support what we need.
          Hide
          Isabel Drost-Fromm added a comment -

          Ted: I am moving the references over into the wiki page. Hope that is fine with you?

          Show
          Isabel Drost-Fromm added a comment - Ted: I am moving the references over into the wiki page. Hope that is fine with you?
          Hide
          Ted Dunning added a comment -

          That would be great. The wiki is really a better place.

          Show
          Ted Dunning added a comment - That would be great. The wiki is really a better place.
          Hide
          Jeff Eastman added a comment -

          This patch moves the display-related classes into the examples subtree and removes the dependency upon commons.math which needs some extra patches to the release and is TBD. It still depends upon blog-0.3 which I have only been able to find in source format. I will try to locate or build a suitable jar for that. I believe, since it is under BSD license we are ok but I'd like confirmation. The blog site is at http://people.csail.mit.edu/milch/blog/software.html.

          Show
          Jeff Eastman added a comment - This patch moves the display-related classes into the examples subtree and removes the dependency upon commons.math which needs some extra patches to the release and is TBD. It still depends upon blog-0.3 which I have only been able to find in source format. I will try to locate or build a suitable jar for that. I believe, since it is under BSD license we are ok but I'd like confirmation. The blog site is at http://people.csail.mit.edu/milch/blog/software.html .
          Hide
          Jeff Eastman added a comment -

          This patch is completely self-contained as it depends only upon the uncommons.math package for its distributions. I think this is ready to commit to trunk, after the release of course. The next steps will be to implement a M/R algorithm using this basic approach.

          Show
          Jeff Eastman added a comment - This patch is completely self-contained as it depends only upon the uncommons.math package for its distributions. I think this is ready to commit to trunk, after the release of course. The next steps will be to implement a M/R algorithm using this basic approach.
          Hide
          Jeff Eastman added a comment -

          Here's the patch file

          Show
          Jeff Eastman added a comment - Here's the patch file
          Hide
          Jeff Eastman added a comment -

          This file contains a tar of a standalone Eclipse project named Dirichlet Cluster. The directory structure is self-contained and only depends upon the Mahout project. It uses the Gson beta1.3 jar file for serialization/deserialization and that is included.

          This version contains an initial Hadoop implementation that has been tested through 10 iterations. In each iteration, clusters are read from the previous iteration (in the state-i directories) and points are assigned to clusters in the Mapper. The reducer then observes all points for each cluster and computes new clusters which are output for the subsequent iteration. The unit test TestMapReduce.testDriverMRIterations() creates 400 data points and runs the MR Driver. Then it gathers all of the state files and summarizes them on the console.

          I noticed when I replaced the beta distribution code earlier that the clustering now tends to put everything into the same cluster. I'm suspicious about the beta values that are being computed and need to investigate this further.

          I think this design will allow an arbitrary number of Mappers and Reducers up to the number of clusters. There is a stub Combiner class that is not currently used. I will continue to develop unit tests but I wanted to get this into view because it is a real first light MR implementation.

          Jeff

          Show
          Jeff Eastman added a comment - This file contains a tar of a standalone Eclipse project named Dirichlet Cluster. The directory structure is self-contained and only depends upon the Mahout project. It uses the Gson beta1.3 jar file for serialization/deserialization and that is included. This version contains an initial Hadoop implementation that has been tested through 10 iterations. In each iteration, clusters are read from the previous iteration (in the state-i directories) and points are assigned to clusters in the Mapper. The reducer then observes all points for each cluster and computes new clusters which are output for the subsequent iteration. The unit test TestMapReduce.testDriverMRIterations() creates 400 data points and runs the MR Driver. Then it gathers all of the state files and summarizes them on the console. I noticed when I replaced the beta distribution code earlier that the clustering now tends to put everything into the same cluster. I'm suspicious about the beta values that are being computed and need to investigate this further. I think this design will allow an arbitrary number of Mappers and Reducers up to the number of clusters. There is a stub Combiner class that is not currently used. I will continue to develop unit tests but I wanted to get this into view because it is a real first light MR implementation. Jeff
          Hide
          Jeff Eastman added a comment -
          • fixed bug in rBeta where arguments to rGamma were backwards
          • fixed model std calculations to handle single sample case (s0=1 => std==0.0 which made resulting pdf==NaN and horked all subsequent cluster assignments)
          • added cluster history display to examples with colors to illustrate cluster convergence

          This is behaving pretty nicely now

          Show
          Jeff Eastman added a comment - fixed bug in rBeta where arguments to rGamma were backwards fixed model std calculations to handle single sample case (s0=1 => std==0.0 which made resulting pdf==NaN and horked all subsequent cluster assignments) added cluster history display to examples with colors to illustrate cluster convergence This is behaving pretty nicely now
          Hide
          Jeff Eastman added a comment -

          generateSamples(500, 0, 0, 0.5);
          generateSamples(500, 2, 0, 0.2);
          generateSamples(500, 0, 2, 0.3);
          generateSamples(500, 2, 2, 1);
          DirichletDriver.runJob("input", "output",
          "org.apache.mahout.clustering.dirichlet.SampledNormalDistribution", 20, 15, 1.0, 2);

          Show
          Jeff Eastman added a comment - generateSamples(500, 0, 0, 0.5); generateSamples(500, 2, 0, 0.2); generateSamples(500, 0, 2, 0.3); generateSamples(500, 2, 2, 1); DirichletDriver.runJob("input", "output", "org.apache.mahout.clustering.dirichlet.SampledNormalDistribution", 20, 15, 1.0, 2);
          Hide
          Jeff Eastman added a comment -

          Final patch file is ready to commit. Need to add entry to pom for gson jar which is currently in core/lib.

          Show
          Jeff Eastman added a comment - Final patch file is ready to commit. Need to add entry to pom for gson jar which is currently in core/lib.
          Hide
          Jeff Eastman added a comment -

          Committed revision 754797. Committed code was refactored slightly from last patch (models moved from test to core), and pom was updated to reflect Gson dependency.

          Show
          Jeff Eastman added a comment - Committed revision 754797. Committed code was refactored slightly from last patch (models moved from test to core), and pom was updated to reflect Gson dependency.

            People

            • Assignee:
              Jeff Eastman
              Reporter:
              Isabel Drost-Fromm
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development