Details

Type: New Feature

Status: Closed

Priority: Major

Resolution: Duplicate

Affects Version/s: None

Fix Version/s: None

Component/s: None

Labels:None
Description
Create a simple prototype implementing Expectation Maximization  EM that demonstrates the algorithm functionality given a set of (user, clickurl) data.
The prototype should be functionally complete and should serve as a basis for the MapReduce version of the EM algorithm.

 dpcluster.r
 5 kB
 Ted Dunning
Issue Links
Activity
 All
 Comments
 Work Log
 History
 Activity
 Transitions
Hi Ankur,
I tried applying the patch, but I think it has some errors in it. It appears the engine has been added twice????
Can you generate a new patch?
Oops! Looks like my Subversive Eclipse plugin did something whacky while generating the pacth. Sorry for that. Please find the recitifed patch file. Hope this goes through fine.
Hi Grant,
Were you able to get the patch working ? Are there any improvements/changes required ?
Hello Ankur,
I checked out the patch yesterday. I was able to successfully apply the patch to the source in svn, so it seems to have arrived fine. I will take a closer look at the code this weekend.
As far as I know EM from clustering tasks it should be possible to port the algorithm to a Hadoop setting in a similar way as the kmeans implementation, right?
Hi Isabel,
The algorithm sure can be ported to a Mapreduce setting on Hadoop. Infact the algorithm has already been mapreduced as mentioned in the Google new personalization paper (Please see the Javadoc for details).
I wrote the nondistributed version of the algorithm to help myself understand, visualize and see the EM algorithm in action starting with a very small dataset. The iterative logic and small dataset particularly helps in seeing how probability values of user and items belonging to a cluster converge for users sharing large number of common items.
I also have a fair idea of how to Mapreduce it. Once the prototype is accepted suggesting features/changes that would be desirable in the mapreduce implementation, It shouldn't take me long to contribute the distributed version.
Your plan of first trying to understand the nondistributed version and then mapreducing the algorithm sounds great Some comments from my point of view:
Maybe you might want to chose more verbose variable names than u, s and z and provide the mapping to the names used in the paper in a comment. Should make it easier for the reader of your code to distinguish users, stories and clusters (z).
I think you might want to inline() the initialize method. For me personally this makes it easier to follow what is done in the constructors. As for the default constructor, you could simply delegate initialization to PLSI_engine(u, s, z) by giving the default values for initialization.
Concerning the method calculate P_z_u_s  how many cluster numbers do you expect? It seems like this computation could become numerically unstable in case of very large numbers of clusters.
It would be nice if you could provide some unit tests to prove that your code is working correctly.
I know EM as a rather general principle  your implementation seems rather focussed on the setup of the google news clustering solution. I was wondering, whether it would be possible to generalize the implementation a little but still support the new personalization use case? Maybe others would like to reuse a general EM framework but not the exact same formulas that you used. Don't know whether that is possible and whether it can be done in a way that is easy to read....
Thanks for your comment. A few of my replies below:
> Maybe you might ..
Will make these changes in the next patch update.
> ...  how many cluster numbers do you expect ...?
Well typically I would expect a user:cluster ratio of 1000:1. So for 1 million users, 1000 clusters would be created.
In main method, a sample userstory matrix is provided which can be changed to experiment. However if required I can write a small unit test case to produce randomnly generated userstory matrix but am not sure if that will help better.
> I know EM as ...
I like the idea of general EM framework. Will definitely try to change the code so that it reflect EM more generically as suggested.
Hi,
So after scratching my head for past couple of days and
understanding EM from a general perspective I built a mental model
for EM in the context of clustering.
I thought its better to share my understanding for getting inputs before
turning out the code.
So here is a short writeup in my words, please feel free to
fill any gaps/errors found
Expectation Maximization for clustering

Let
z = unobserved data, clusters in our case.
y = observed data, points in our case.
We start by initializing the model paramters p(yz) and p(z) to appropriately
normalized random values.
For eg: let number of points = 4 and number of clusters = 2
then for cluster z1
p(y1z1) + p(y2z1) + p(y3z1) + p(y4z1) = 1
p(z1) + p(z2) = 1
EStep.

Calculate posteriori estimates of probabilities for various
values of the unknown z as follows:
p(y,z) p(yz)*p(z)
p(zy) =  = 
p summation over z
Calculate expected value of log likelihood as follows:
Q = summation over z
{ p(zy) * log(p(y,z))} Q remains unchanged if the calue calculated is = previous Q value.
The algorithm terminates when no improvement is seen for Q for all y
Use Q as the value for p(y,z) for MStep to recompute model parameters
p(yz) and p(z)
MStep

p(y,z) (from EStep)
p(yz) = 
summation over y
p(z) = summation over y { p(y,z) }
(from EStep)
Questions
=========
1. When and how do we recompute the cluster centers ?
2. As per my understanding points and clusters are simply labels with some
conditional probability assigned to them. A distance metric like one
used in Kmeans is nowhere involved, is that correct ?
Adding the comments sent to the list here as well for further reference.
> 
> So here is a short writeup in my words, please feel free to
> fill any gaps/errors found
I will try to do so from my perspective, maybe others can add their views.
> Expectation Maximization for clustering
> 
> Let
> z = unobserved data, clusters in our case.
> y = observed data, points in our case.
>
> p(y1z1) + p(y2z1) + p(y3z1) + p(y4z1) = 1
> p(z1) + p(z2) = 1
Looks correct to me.
> EStep.
> 
> MStep
> 
I could not find an error in neither of the two steps so far.
> Questions
> =========
> 1. When and how do we recompute the cluster centers ?
EM does not work with explicit cluster centers. In kmeans you iterate two
steps: Assigning points to centers and recomputing the centers. In EM you
again iterate two steps: Computing the probabilities for each point belonging
to the clusters (so you do not assign them hard to one cluster, you only say
with probability P it belongs to clusters i to k), in the second step you
recompute the parameters of each cluster  the cluster center is influenced
by each point but only weighted by its probability of belonging to this
cluster.
> 2. As per my understanding points and clusters are simply labels with some
> conditional probability assigned to them. A distance metric like one
> used in Kmeans is nowhere involved, is that correct ?
Yes and no: Technically no, conceptually, your computation for the probability
of assigning a point to a cluster should be based on the point's distance to
the cluster.
I hope I did not cause more confusion than helping you. Maybe others can
correct me or clarify what I left unclear...
Isabel
EM clustering is very seriously prone to overfitting if you give reasonable flexibility to the clusters.
An important adjustment is to put a reasonable prior on the distributions being mixed. This serves as regularization that helps avoid the problem. Kmeans (sort of) avoids the problem by assuming all clusters are symmetric with identical variance.
EM clustering can also be changed very slightly by assigning points to single clusters chosen at random according to the probability of membership. This turns EM clustering into Gibb's sampling. The important property that is changed is that you now can sample over the distribution of possible samplings which can be very important if some parts of your data are well defined and some parts not so well defined.
Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly) nonparametric clustering algorithm. I will attach an R implementation for reference.
R implementation of Dirichlet Process based clustering.
> An important adjustment is to put a reasonable prior on the distributions
> being mixed. This serves as regularization that helps avoid the problem.
> Kmeans (sort of) avoids the problem by assuming all clusters are symmetric
> with identical variance.
I think you could impose the same restriction to EM as well?
> EM clustering can also be changed very slightly by assigning points to
> single clusters chosen at random according to the probability of
> membership. This turns EM clustering into Gibb's sampling.
That is the simplest explanation of Gibb's sampling I have read so far
> Further extension can also be made by assuming an infinite mixture model.
> The implementation is only slightly more difficult and the result is a
> (nearly) nonparametric clustering algorithm. I will attach an R
> implementation for reference.
I think the dirichlet process based clustering comes with the handy property that you can avoid passing the number of parameters into the algorithm, right? To me that seems better for realistic settings where you usually have some data available but you cannot tell how many clusters are there.
Do you think, one can solve the original PLSI problem with Gibb's sampling or an infinite mixture model as well? After all the original patch was about integrating PLSI that is based on EM. I wonder whether one should split this thread into at least four threads:
 EM implementation
 Gibb's sampling implementation
 dirichlet process implementation
 PLSI based on EM
What do you think?
EM seems to be an algorithm to general to be implemented only with the use case of PLSI in mind. In addition PLSI can be addressed with other approaches as well. So, this issue is split into 4 parts. Each of the sub tasks will be linked back to this issue so work can go on more focussed on the individual features without loosing track of where the discussion came from.
(Now that 0.2 is going out, turning to a bit of JIRA cleanup)
No activity in 18 months – close as obsolete?
I think we can close this as obsolete as there are other related JIRA's with specific and more defined focus.
Yes. LDA supercedes most of this.
Here is the prototype implementation of of Probabilistic Latent Semantic Indexing (PLSI) that uses Expectation Maximization. Please refer to javadoc comments for explanation.
Feel free to experiment with the code and have fun