Mahout
  1. Mahout
  2. MAHOUT-4

Simple prototype for Expectation Maximization (EM)

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Create a simple prototype implementing Expectation Maximization - EM that demonstrates the algorithm functionality given a set of (user, click-url) data.
      The prototype should be functionally complete and should serve as a basis for the Map-Reduce version of the EM algorithm.

      1. dp-cluster.r
        5 kB
        Ted Dunning

        Issue Links

          Activity

          Hide
          Ankur added a comment -

          Here is the prototype implementation of of Probabilistic Latent Semantic Indexing (PLSI) that uses Expectation Maximization. Please refer to javadoc comments for explanation.

          Feel free to experiment with the code and have fun

          Show
          Ankur added a comment - Here is the prototype implementation of of Probabilistic Latent Semantic Indexing (PLSI) that uses Expectation Maximization. Please refer to javadoc comments for explanation. Feel free to experiment with the code and have fun
          Hide
          Grant Ingersoll added a comment -

          Hi Ankur,

          I tried applying the patch, but I think it has some errors in it. It appears the engine has been added twice????

          Can you generate a new patch?

          Show
          Grant Ingersoll added a comment - Hi Ankur, I tried applying the patch, but I think it has some errors in it. It appears the engine has been added twice???? Can you generate a new patch?
          Hide
          Ankur added a comment -

          Oops! Looks like my Subversive Eclipse plugin did something whacky while generating the pacth. Sorry for that. Please find the recitifed patch file. Hope this goes through fine.

          Show
          Ankur added a comment - Oops! Looks like my Subversive Eclipse plugin did something whacky while generating the pacth. Sorry for that. Please find the recitifed patch file. Hope this goes through fine.
          Hide
          Ankur added a comment -

          Hi Grant,
          Were you able to get the patch working ? Are there any improvements/changes required ?

          Show
          Ankur added a comment - Hi Grant, Were you able to get the patch working ? Are there any improvements/changes required ?
          Hide
          Isabel Drost-Fromm added a comment -

          Hello Ankur,

          I checked out the patch yesterday. I was able to successfully apply the patch to the source in svn, so it seems to have arrived fine. I will take a closer look at the code this weekend.

          As far as I know EM from clustering tasks it should be possible to port the algorithm to a Hadoop setting in a similar way as the k-means implementation, right?

          Show
          Isabel Drost-Fromm added a comment - Hello Ankur, I checked out the patch yesterday. I was able to successfully apply the patch to the source in svn, so it seems to have arrived fine. I will take a closer look at the code this weekend. As far as I know EM from clustering tasks it should be possible to port the algorithm to a Hadoop setting in a similar way as the k-means implementation, right?
          Hide
          Ankur added a comment -

          Hi Isabel,
          The algorithm sure can be ported to a Map-reduce setting on Hadoop. In-fact the algorithm has already been map-reduced as mentioned in the Google new personalization paper (Please see the Javadoc for details).

          I wrote the non-distributed version of the algorithm to help myself understand, visualize and see the EM algorithm in action starting with a very small dataset. The iterative logic and small dataset particularly helps in seeing how probability values of user and items belonging to a cluster converge for users sharing large number of common items.

          I also have a fair idea of how to Map-reduce it. Once the prototype is accepted suggesting features/changes that would be desirable in the map-reduce implementation, It shouldn't take me long to contribute the distributed version.

          Show
          Ankur added a comment - Hi Isabel, The algorithm sure can be ported to a Map-reduce setting on Hadoop. In-fact the algorithm has already been map-reduced as mentioned in the Google new personalization paper (Please see the Javadoc for details). I wrote the non-distributed version of the algorithm to help myself understand, visualize and see the EM algorithm in action starting with a very small dataset. The iterative logic and small dataset particularly helps in seeing how probability values of user and items belonging to a cluster converge for users sharing large number of common items. I also have a fair idea of how to Map-reduce it. Once the prototype is accepted suggesting features/changes that would be desirable in the map-reduce implementation, It shouldn't take me long to contribute the distributed version.
          Hide
          Isabel Drost-Fromm added a comment -

          Your plan of first trying to understand the non-distributed version and then map-reducing the algorithm sounds great Some comments from my point of view:

          Maybe you might want to chose more verbose variable names than u, s and z and provide the mapping to the names used in the paper in a comment. Should make it easier for the reader of your code to distinguish users, stories and clusters (z).

          I think you might want to inline() the initialize method. For me personally this makes it easier to follow what is done in the constructors. As for the default constructor, you could simply delegate initialization to PLSI_engine(u, s, z) by giving the default values for initialization.

          Concerning the method calculate P_z_u_s - how many cluster numbers do you expect? It seems like this computation could become numerically unstable in case of very large numbers of clusters.

          It would be nice if you could provide some unit tests to prove that your code is working correctly.

          I know EM as a rather general principle - your implementation seems rather focussed on the setup of the google news clustering solution. I was wondering, whether it would be possible to generalize the implementation a little but still support the new personalization use case? Maybe others would like to reuse a general EM framework but not the exact same formulas that you used. Don't know whether that is possible and whether it can be done in a way that is easy to read....

          Show
          Isabel Drost-Fromm added a comment - Your plan of first trying to understand the non-distributed version and then map-reducing the algorithm sounds great Some comments from my point of view: Maybe you might want to chose more verbose variable names than u, s and z and provide the mapping to the names used in the paper in a comment. Should make it easier for the reader of your code to distinguish users, stories and clusters (z). I think you might want to inline() the initialize method. For me personally this makes it easier to follow what is done in the constructors. As for the default constructor, you could simply delegate initialization to PLSI_engine(u, s, z) by giving the default values for initialization. Concerning the method calculate P_z_u_s - how many cluster numbers do you expect? It seems like this computation could become numerically unstable in case of very large numbers of clusters. It would be nice if you could provide some unit tests to prove that your code is working correctly. I know EM as a rather general principle - your implementation seems rather focussed on the setup of the google news clustering solution. I was wondering, whether it would be possible to generalize the implementation a little but still support the new personalization use case? Maybe others would like to reuse a general EM framework but not the exact same formulas that you used. Don't know whether that is possible and whether it can be done in a way that is easy to read....
          Hide
          Ankur added a comment -

          Thanks for your comment. A few of my replies below:-
          > Maybe you might ..
          Will make these changes in the next patch update.

          > ... - how many cluster numbers do you expect ...?
          Well typically I would expect a user:cluster ratio of 1000:1. So for 1 million users, 1000 clusters would be created.

          In main method, a sample user-story matrix is provided which can be changed to experiment. However if required I can write a small unit test case to produce randomnly generated user-story matrix but am not sure if that will help better.

          > I know EM as ...
          I like the idea of general EM framework. Will definitely try to change the code so that it reflect EM more generically as suggested.

          Show
          Ankur added a comment - Thanks for your comment. A few of my replies below:- > Maybe you might .. Will make these changes in the next patch update. > ... - how many cluster numbers do you expect ...? Well typically I would expect a user:cluster ratio of 1000:1. So for 1 million users, 1000 clusters would be created. In main method, a sample user-story matrix is provided which can be changed to experiment. However if required I can write a small unit test case to produce randomnly generated user-story matrix but am not sure if that will help better. > I know EM as ... I like the idea of general EM framework. Will definitely try to change the code so that it reflect EM more generically as suggested.
          Hide
          Ankur added a comment -

          Hi,
          So after scratching my head for past couple of days and
          understanding EM from a general perspective I built a mental model
          for EM in the context of clustering.

          I thought its better to share my understanding for getting inputs before
          turning out the code.

          So here is a short write-up in my words, please feel free to
          fill any gaps/errors found

          Expectation Maximization for clustering
          -----------------------------------------------------
          Let
          z = unobserved data, clusters in our case.
          y = observed data, points in our case.

          We start by initializing the model paramters p(y|z) and p(z) to appropriately
          normalized random values.
          For eg:- let number of points = 4 and number of clusters = 2
          then for cluster z1

          p(y1|z1) + p(y2|z1) + p(y3|z1) + p(y4|z1) = 1
          p(z1) + p(z2) = 1

          E-Step.
          ------
          Calculate posteriori estimates of probabilities for various
          values of the unknown z as follows:-

          p(y,z) p(y|z)*p(z)
          p(z|y) = --------- = -----------------------------------------
          p summation over z

          { p(y|z)*p(z) }

          Calculate expected value of log likelihood as follows:-

          Q = summation over z

          { p(z|y) * log(p(y,z))}

          Q remains unchanged if the calue calculated is = previous Q value.
          The algorithm terminates when no improvement is seen for Q for all y

          Use Q as the value for p(y,z) for M-Step to re-compute model parameters
          p(y|z) and p(z)

          M-Step
          ------
          p(y,z) (from E-Step)
          p(y|z) = --------------------------------------------------
          summation over y

          { p(y,z) } (from E-Step)

          p(z) = summation over y { p(y,z) }

          (from E-Step)

          Questions
          =========
          1. When and how do we re-compute the cluster centers ?

          2. As per my understanding points and clusters are simply labels with some
          conditional probability assigned to them. A distance metric like one
          used in K-means is nowhere involved, is that correct ?

          Show
          Ankur added a comment - Hi, So after scratching my head for past couple of days and understanding EM from a general perspective I built a mental model for EM in the context of clustering. I thought its better to share my understanding for getting inputs before turning out the code. So here is a short write-up in my words, please feel free to fill any gaps/errors found Expectation Maximization for clustering ----------------------------------------------------- Let z = unobserved data, clusters in our case. y = observed data, points in our case. We start by initializing the model paramters p(y|z) and p(z) to appropriately normalized random values. For eg:- let number of points = 4 and number of clusters = 2 then for cluster z1 p(y1|z1) + p(y2|z1) + p(y3|z1) + p(y4|z1) = 1 p(z1) + p(z2) = 1 E-Step. ------ Calculate posteriori estimates of probabilities for various values of the unknown z as follows:- p(y,z) p(y|z)*p(z) p(z|y) = --------- = ----------------------------------------- p summation over z { p(y|z)*p(z) } Calculate expected value of log likelihood as follows:- Q = summation over z { p(z|y) * log(p(y,z))} Q remains unchanged if the calue calculated is = previous Q value. The algorithm terminates when no improvement is seen for Q for all y Use Q as the value for p(y,z) for M-Step to re-compute model parameters p(y|z) and p(z) M-Step ------ p(y,z) (from E-Step) p(y|z) = -------------------------------------------------- summation over y { p(y,z) } (from E-Step) p(z) = summation over y { p(y,z) } (from E-Step) Questions ========= 1. When and how do we re-compute the cluster centers ? 2. As per my understanding points and clusters are simply labels with some conditional probability assigned to them. A distance metric like one used in K-means is nowhere involved, is that correct ?
          Hide
          Isabel Drost-Fromm added a comment -

          Adding the comments sent to the list here as well for further reference.

          > ----------------------------
          > So here is a short write-up in my words, please feel free to
          > fill any gaps/errors found

          I will try to do so from my perspective, maybe others can add their views.

          > Expectation Maximization for clustering
          > -----------------------------------------------------
          > Let
          > z = unobserved data, clusters in our case.
          > y = observed data, points in our case.
          >
          > p(y1|z1) + p(y2|z1) + p(y3|z1) + p(y4|z1) = 1
          > p(z1) + p(z2) = 1

          Looks correct to me.

          > E-Step.
          > ------
          > M-Step
          > ------

          I could not find an error in neither of the two steps so far.

          > Questions
          > =========
          > 1. When and how do we re-compute the cluster centers ?

          EM does not work with explicit cluster centers. In kmeans you iterate two
          steps: Assigning points to centers and recomputing the centers. In EM you
          again iterate two steps: Computing the probabilities for each point belonging
          to the clusters (so you do not assign them hard to one cluster, you only say
          with probability P it belongs to clusters i to k), in the second step you
          recompute the parameters of each cluster - the cluster center is influenced
          by each point but only weighted by its probability of belonging to this
          cluster.

          > 2. As per my understanding points and clusters are simply labels with some
          > conditional probability assigned to them. A distance metric like one
          > used in K-means is nowhere involved, is that correct ?

          Yes and no: Technically no, conceptually, your computation for the probability
          of assigning a point to a cluster should be based on the point's distance to
          the cluster.

          I hope I did not cause more confusion than helping you. Maybe others can
          correct me or clarify what I left unclear...

          Isabel

          Show
          Isabel Drost-Fromm added a comment - Adding the comments sent to the list here as well for further reference. > ---------------------------- > So here is a short write-up in my words, please feel free to > fill any gaps/errors found I will try to do so from my perspective, maybe others can add their views. > Expectation Maximization for clustering > ----------------------------------------------------- > Let > z = unobserved data, clusters in our case. > y = observed data, points in our case. > > p(y1|z1) + p(y2|z1) + p(y3|z1) + p(y4|z1) = 1 > p(z1) + p(z2) = 1 Looks correct to me. > E-Step. > ------ > M-Step > ------ I could not find an error in neither of the two steps so far. > Questions > ========= > 1. When and how do we re-compute the cluster centers ? EM does not work with explicit cluster centers. In kmeans you iterate two steps: Assigning points to centers and recomputing the centers. In EM you again iterate two steps: Computing the probabilities for each point belonging to the clusters (so you do not assign them hard to one cluster, you only say with probability P it belongs to clusters i to k), in the second step you recompute the parameters of each cluster - the cluster center is influenced by each point but only weighted by its probability of belonging to this cluster. > 2. As per my understanding points and clusters are simply labels with some > conditional probability assigned to them. A distance metric like one > used in K-means is nowhere involved, is that correct ? Yes and no: Technically no, conceptually, your computation for the probability of assigning a point to a cluster should be based on the point's distance to the cluster. I hope I did not cause more confusion than helping you. Maybe others can correct me or clarify what I left unclear... Isabel
          Hide
          Ted Dunning added a comment -

          EM clustering is very seriously prone to over-fitting if you give reasonable flexibility to the clusters.

          An important adjustment is to put a reasonable prior on the distributions being mixed. This serves as regularization that helps avoid the problem. K-means (sort of) avoids the problem by assuming all clusters are symmetric with identical variance.

          EM clustering can also be changed very slightly by assigning points to single clusters chosen at random according to the probability of membership. This turns EM clustering into Gibb's sampling. The important property that is changed is that you now can sample over the distribution of possible samplings which can be very important if some parts of your data are well defined and some parts not so well defined.

          Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly) non-parametric clustering algorithm. I will attach an R implementation for reference.

          Show
          Ted Dunning added a comment - EM clustering is very seriously prone to over-fitting if you give reasonable flexibility to the clusters. An important adjustment is to put a reasonable prior on the distributions being mixed. This serves as regularization that helps avoid the problem. K-means (sort of) avoids the problem by assuming all clusters are symmetric with identical variance. EM clustering can also be changed very slightly by assigning points to single clusters chosen at random according to the probability of membership. This turns EM clustering into Gibb's sampling. The important property that is changed is that you now can sample over the distribution of possible samplings which can be very important if some parts of your data are well defined and some parts not so well defined. Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly) non-parametric clustering algorithm. I will attach an R implementation for reference.
          Hide
          Ted Dunning added a comment -


          R implementation of Dirichlet Process based clustering.

          Show
          Ted Dunning added a comment - R implementation of Dirichlet Process based clustering.
          Hide
          Isabel Drost-Fromm added a comment -

          > An important adjustment is to put a reasonable prior on the distributions
          > being mixed. This serves as regularization that helps avoid the problem.
          > K-means (sort of) avoids the problem by assuming all clusters are symmetric
          > with identical variance.

          I think you could impose the same restriction to EM as well?

          > EM clustering can also be changed very slightly by assigning points to
          > single clusters chosen at random according to the probability of
          > membership. This turns EM clustering into Gibb's sampling.

          That is the simplest explanation of Gibb's sampling I have read so far

          > Further extension can also be made by assuming an infinite mixture model.
          > The implementation is only slightly more difficult and the result is a
          > (nearly) non-parametric clustering algorithm. I will attach an R
          > implementation for reference.

          I think the dirichlet process based clustering comes with the handy property that you can avoid passing the number of parameters into the algorithm, right? To me that seems better for realistic settings where you usually have some data available but you cannot tell how many clusters are there.

          Do you think, one can solve the original PLSI problem with Gibb's sampling or an infinite mixture model as well? After all the original patch was about integrating PLSI that is based on EM. I wonder whether one should split this thread into at least four threads:

          • EM implementation
          • Gibb's sampling implementation
          • dirichlet process implementation
          • PLSI based on EM

          What do you think?

          Show
          Isabel Drost-Fromm added a comment - > An important adjustment is to put a reasonable prior on the distributions > being mixed. This serves as regularization that helps avoid the problem. > K-means (sort of) avoids the problem by assuming all clusters are symmetric > with identical variance. I think you could impose the same restriction to EM as well? > EM clustering can also be changed very slightly by assigning points to > single clusters chosen at random according to the probability of > membership. This turns EM clustering into Gibb's sampling. That is the simplest explanation of Gibb's sampling I have read so far > Further extension can also be made by assuming an infinite mixture model. > The implementation is only slightly more difficult and the result is a > (nearly) non-parametric clustering algorithm. I will attach an R > implementation for reference. I think the dirichlet process based clustering comes with the handy property that you can avoid passing the number of parameters into the algorithm, right? To me that seems better for realistic settings where you usually have some data available but you cannot tell how many clusters are there. Do you think, one can solve the original PLSI problem with Gibb's sampling or an infinite mixture model as well? After all the original patch was about integrating PLSI that is based on EM. I wonder whether one should split this thread into at least four threads: EM implementation Gibb's sampling implementation dirichlet process implementation PLSI based on EM What do you think?
          Hide
          Isabel Drost-Fromm added a comment -

          EM seems to be an algorithm to general to be implemented only with the use case of PLSI in mind. In addition PLSI can be addressed with other approaches as well. So, this issue is split into 4 parts. Each of the sub tasks will be linked back to this issue so work can go on more focussed on the individual features without loosing track of where the discussion came from.

          Show
          Isabel Drost-Fromm added a comment - EM seems to be an algorithm to general to be implemented only with the use case of PLSI in mind. In addition PLSI can be addressed with other approaches as well. So, this issue is split into 4 parts. Each of the sub tasks will be linked back to this issue so work can go on more focussed on the individual features without loosing track of where the discussion came from.
          Hide
          Sean Owen added a comment -

          (Now that 0.2 is going out, turning to a bit of JIRA cleanup)

          No activity in 18 months – close as obsolete?

          Show
          Sean Owen added a comment - (Now that 0.2 is going out, turning to a bit of JIRA cleanup) No activity in 18 months – close as obsolete?
          Hide
          Ankur added a comment -

          I think we can close this as obsolete as there are other related JIRA's with specific and more defined focus.

          Show
          Ankur added a comment - I think we can close this as obsolete as there are other related JIRA's with specific and more defined focus.
          Hide
          Ted Dunning added a comment -

          Yes. LDA supercedes most of this.

          Show
          Ted Dunning added a comment - Yes. LDA supercedes most of this.

            People

            • Assignee:
              Unassigned
              Reporter:
              Ankur
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development