I think that this needs more study. I got email from Mike and it does seem that there is a reasonable likelihood that there is still a serious problem. The problem is that I respect both Mike and David's opinions pretty highly and they seem to draw incompatible conclusions. That still leaves me with the feeling that a problem is reasonably likely (> 10% chance at least).
I have implemented a parallel version of LDA in C# that separates the processing, but not the data. It is based on collapsed Gibbs sampling. And it converges to the correct solution on the overlapping pyramids dataset.
The last e-mail from David Hall indicated to me that he did not think the result for the dataset was conclusive evidence there is a bug. I disagree. The statistics of the dataset are overwhelming. And when you look at the computed likelihood of the corpus it typically reaches its maximum at 5 topics.
It took me a while to get hadoop up and running on ec2 and then to get the Mahout examples running. After David's e-mail indicating he did not think the result was conclusive, I decided to implement something for the environment I am working in.
I did not see much in the way of documentation for the Mahout implementation, but my guess at the algorithm was that it was using a variational method. Since I have not implemented that approach, I do not have an idea where the bug is yet.
Blei's C version implementation does converge as well. On rare occasion it does not converge, but rerunning it will almost always yield convergence.
I have run David Hall's implementation for different numbers of topics and repeatedly for each number of topics. It has never converged.
I did send a document along describing the dataset and providing a sample so that someone else could corroborate the result. I may have made a procedural error in running LDA even though I think I ran everything correctly.
I would be interested in looking at the variational approach and then trying to debug the current algorithm, but I do not have time to do that at the moment. Another option would be to convince David Hall to take a second look.
I hope that helps a little. I would be happy to talk to anyone in more detail.