Mahout
  1. Mahout
  2. MAHOUT-399

LDA on Mahout 0.3 does not converge to correct solution for overlapping pyramids toy problem.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 0.3, 0.4, 0.5
    • Fix Version/s: 0.7
    • Component/s: Classification
    • Labels:
    • Environment:

      Mac OS X 10.6.2, Hadoop 0.20.2, Mahout 0.3.

      Description

      Hello,

      Apologies if I have not labeled this correctly.

      I have run a toy problem on Mahout 0.3 (locally) for LDA that I used to test Blei's c version of LDA that he posts on his site. It has an exact solution that the LDA should converge to. Please see attached PDF that describes the intended output.

      Is LDA working? The following output indicates some sort of collapsing behavior to me.

      T0 T1 T2 T3 T4
      x w x u x
      u u g j n
      l r i m l
      j q h h p
      v p e i q
      e t f g v
      d s d f o
      b c b n k
      y f c l m
      w v u v u
      c d p y t
      k o l r r
      i b j k j
      f e k e f
      g x y s y
      t y w b w
      h i s p s
      o l v x d
      q j t d i
      n k o t b

      The intended output is (again, please see attached):

      D I N S X
      d i n s x
      c h m t y
      e j o r w
      b k l u v
      f g p q a
      a f k p b
      g l q v u
      h m j w t
      y u r o c
      n s d d i
      s e x f f
      r q i i n
      m v w c o
      o w u a h
      q n s h g
      p t c x d
      t x f e l
      x d e j s
      w y g b j
      i r y n r
      u o h y m
      k b t l e
      v c a m k
      j a b g p
      l p v k q

      What tests do you run to make sure the output is correct?

      Thank you,
      Mike.

      1. Overlapping Pyramids Toy Dataset.pdf
        936 kB
        Michael Lazarus
      2. olt.tar
        9.77 MB
        Michael Lazarus
      3. 1000docs_26terms_5topics.jpg
        52 kB
        Jake Mannix
      4. MAHOUT-399.diff
        13 kB
        Jake Mannix

        Issue Links

          Activity

          Hide
          Michael Lazarus added a comment -

          Please see attached pdf that decribes the toy problem and expected output. Also please find attached a sample dataset including 10,000 small documents created according to the method described in the dataset.

          Show
          Michael Lazarus added a comment - Please see attached pdf that decribes the toy problem and expected output. Also please find attached a sample dataset including 10,000 small documents created according to the method described in the dataset.
          Hide
          Michael Lazarus added a comment -

          The file olt.tar contains a 10,000 document corpus generated according to the method described in pdf.

          Show
          Michael Lazarus added a comment - The file olt.tar contains a 10,000 document corpus generated according to the method described in pdf.
          Hide
          Jeff Eastman added a comment -

          The LDA unit tests are in org.apache.mahout.clustering.lda and are not extensive. It would be really nice to get the same answer. Can you turn your example into another test case?

          Show
          Jeff Eastman added a comment - The LDA unit tests are in org.apache.mahout.clustering.lda and are not extensive. It would be really nice to get the same answer. Can you turn your example into another test case?
          Hide
          Michael Lazarus added a comment -

          Hi Jeff,

          I would be happy to give it a try. I will take a look at the existing unit tests.

          Thanks,
          Mike.

          Show
          Michael Lazarus added a comment - Hi Jeff, I would be happy to give it a try. I will take a look at the existing unit tests. Thanks, Mike.
          Hide
          David Hall added a comment -

          LDA, as most unsupervised learning problems are, does not present as a convex problem. Given different starting points, minor (but still correct) implementation differences, you can arrive at different results.

          That's not to say that there isn't a bug here, but just because we don't converge to the same value doesn't mean that there is a bug.

          Show
          David Hall added a comment - LDA, as most unsupervised learning problems are, does not present as a convex problem. Given different starting points, minor (but still correct) implementation differences, you can arrive at different results. That's not to say that there isn't a bug here, but just because we don't converge to the same value doesn't mean that there is a bug.
          Hide
          Michael Lazarus added a comment -

          Hi David,

          Thank you for responding. I agree that you can arrive at different results with slightly different implementations. And also sometimes LDA gets stuck in a local minimum. In the pdf, for example, looking at the log-likelihood of the corpus given the model across models with different numbers of topics, you can see that Blei's implementation of LDA gets stuck on a ten topic model.

          This dataset was designed to have just enough structure to demonstrate some of the behavior of the algorithm while at the same time defining an underlying model that should be readily discoverable by an implementation.

          I have found it useful in debugging various systems that use LDA.

          Thanks,
          Mike.

          Show
          Michael Lazarus added a comment - Hi David, Thank you for responding. I agree that you can arrive at different results with slightly different implementations. And also sometimes LDA gets stuck in a local minimum. In the pdf, for example, looking at the log-likelihood of the corpus given the model across models with different numbers of topics, you can see that Blei's implementation of LDA gets stuck on a ten topic model. This dataset was designed to have just enough structure to demonstrate some of the behavior of the algorithm while at the same time defining an underlying model that should be readily discoverable by an implementation. I have found it useful in debugging various systems that use LDA. Thanks, Mike.
          Hide
          Sean Owen added a comment -

          What's the verdict here? the implementation is probably OK or this needs more study?

          Show
          Sean Owen added a comment - What's the verdict here? the implementation is probably OK or this needs more study?
          Hide
          Ted Dunning added a comment -

          I think that this needs more study. I got email from Mike and it does seem that there is a reasonable likelihood that there is still a serious problem. The problem is that I respect both Mike and David's opinions pretty highly and they seem to draw incompatible conclusions. That still leaves me with the feeling that a problem is reasonably likely (> 10% chance at least).

          Hi Ted,

          I have implemented a parallel version of LDA in C# that separates the processing, but not the data. It is based on collapsed Gibbs sampling. And it converges to the correct solution on the overlapping pyramids dataset.

          The last e-mail from David Hall indicated to me that he did not think the result for the dataset was conclusive evidence there is a bug. I disagree. The statistics of the dataset are overwhelming. And when you look at the computed likelihood of the corpus it typically reaches its maximum at 5 topics.

          It took me a while to get hadoop up and running on ec2 and then to get the Mahout examples running. After David's e-mail indicating he did not think the result was conclusive, I decided to implement something for the environment I am working in.

          I did not see much in the way of documentation for the Mahout implementation, but my guess at the algorithm was that it was using a variational method. Since I have not implemented that approach, I do not have an idea where the bug is yet.

          Blei's C version implementation does converge as well. On rare occasion it does not converge, but rerunning it will almost always yield convergence.

          I have run David Hall's implementation for different numbers of topics and repeatedly for each number of topics. It has never converged.

          I did send a document along describing the dataset and providing a sample so that someone else could corroborate the result. I may have made a procedural error in running LDA even though I think I ran everything correctly.

          I would be interested in looking at the variational approach and then trying to debug the current algorithm, but I do not have time to do that at the moment. Another option would be to convince David Hall to take a second look.

          I hope that helps a little. I would be happy to talk to anyone in more detail.

          Thanks,
          Mike.

          Show
          Ted Dunning added a comment - I think that this needs more study. I got email from Mike and it does seem that there is a reasonable likelihood that there is still a serious problem. The problem is that I respect both Mike and David's opinions pretty highly and they seem to draw incompatible conclusions. That still leaves me with the feeling that a problem is reasonably likely (> 10% chance at least). Hi Ted, I have implemented a parallel version of LDA in C# that separates the processing, but not the data. It is based on collapsed Gibbs sampling. And it converges to the correct solution on the overlapping pyramids dataset. The last e-mail from David Hall indicated to me that he did not think the result for the dataset was conclusive evidence there is a bug. I disagree. The statistics of the dataset are overwhelming. And when you look at the computed likelihood of the corpus it typically reaches its maximum at 5 topics. It took me a while to get hadoop up and running on ec2 and then to get the Mahout examples running. After David's e-mail indicating he did not think the result was conclusive, I decided to implement something for the environment I am working in. I did not see much in the way of documentation for the Mahout implementation, but my guess at the algorithm was that it was using a variational method. Since I have not implemented that approach, I do not have an idea where the bug is yet. Blei's C version implementation does converge as well. On rare occasion it does not converge, but rerunning it will almost always yield convergence. I have run David Hall's implementation for different numbers of topics and repeatedly for each number of topics. It has never converged. I did send a document along describing the dataset and providing a sample so that someone else could corroborate the result. I may have made a procedural error in running LDA even though I think I ran everything correctly. I would be interested in looking at the variational approach and then trying to debug the current algorithm, but I do not have time to do that at the moment. Another option would be to convince David Hall to take a second look. I hope that helps a little. I would be happy to talk to anyone in more detail. Thanks, Mike.
          Hide
          Sean Owen added a comment -

          Not sure if I can be of help, but seems worthwhile keeping this on the radar for 0.5.

          Show
          Sean Owen added a comment - Not sure if I can be of help, but seems worthwhile keeping this on the radar for 0.5.
          Hide
          Robin Anil added a comment -

          David, Mike? Did either of you make any progress on debugging the issue?

          Show
          Robin Anil added a comment - David, Mike? Did either of you make any progress on debugging the issue?
          Hide
          Grant Ingersoll added a comment -

          It would be nice to have resolve this for 0.5. Mike, anyway we can get a unit test for it?

          Show
          Grant Ingersoll added a comment - It would be nice to have resolve this for 0.5. Mike, anyway we can get a unit test for it?
          Hide
          Grant Ingersoll added a comment -

          Ted, I'm trying to setup a test for this, can I assign to me?

          Mike, if you are still around: any chance you can put up your input generation code at a minimum. If you can do that, I will incorporate into a test case, as I would rather not check in 1000 small files.

          I still am not sure what is correct here yet, but I agree with Ted that it merits looking into a bit more and a test case seems the most logical way to do that.

          Show
          Grant Ingersoll added a comment - Ted, I'm trying to setup a test for this, can I assign to me? Mike, if you are still around: any chance you can put up your input generation code at a minimum. If you can do that, I will incorporate into a test case, as I would rather not check in 1000 small files. I still am not sure what is correct here yet, but I agree with Ted that it merits looking into a bit more and a test case seems the most logical way to do that.
          Hide
          Ted Dunning added a comment -

          Grant, you have the helm.

          The dataset that Mike provided should be adaptable into a test case. It is a bit large, but should compress pretty well and the documents can probably be shoe-horned into a single file.

          That would leave us with (probably) a 1MB resource file to be loaded by the test. That seems doable.

          Show
          Ted Dunning added a comment - Grant, you have the helm. The dataset that Mike provided should be adaptable into a test case. It is a bit large, but should compress pretty well and the documents can probably be shoe-horned into a single file. That would leave us with (probably) a 1MB resource file to be loaded by the test. That seems doable.
          Hide
          Michael Lazarus added a comment -

          I can write a unit test for the dataset. How should I submit it?

          Show
          Michael Lazarus added a comment - I can write a unit test for the dataset. How should I submit it?
          Hide
          Michael Lazarus added a comment -

          I can write a unit test for the dataset. How should I submit it?

          Show
          Michael Lazarus added a comment - I can write a unit test for the dataset. How should I submit it?
          Hide
          Ted Dunning added a comment -

          As a patch attached to the JIRA.

          If a patch is a pain, you can even just attach a tar of new java files. I
          am happy to help smooth any rough edges.

          Show
          Ted Dunning added a comment - As a patch attached to the JIRA. If a patch is a pain, you can even just attach a tar of new java files. I am happy to help smooth any rough edges.
          Hide
          Michael Lazarus added a comment -

          Mike, if you are still around: any chance you can put up your input generation
          code at a minimum. If you can do that, I will incorporate into a test case, as
          I would rather not check in 1000 small files.

          I still am not sure what is correct here yet, but I agree with Ted that it
          merits looking into a bit more and a test case seems the most logical way to do
          that.

          ...

          Ok. I just saw this. Maybe what I can do as a first step is take Blei's C
          implementation, or other Java open source implementation, and write a java test
          case around that. Then you can plug in the Mahout piece. I can just put the
          data into a static array so no files are necessary.

          Show
          Michael Lazarus added a comment - Mike, if you are still around: any chance you can put up your input generation code at a minimum. If you can do that, I will incorporate into a test case, as I would rather not check in 1000 small files. I still am not sure what is correct here yet, but I agree with Ted that it merits looking into a bit more and a test case seems the most logical way to do that. ... Ok. I just saw this. Maybe what I can do as a first step is take Blei's C implementation, or other Java open source implementation, and write a java test case around that. Then you can plug in the Mahout piece. I can just put the data into a static array so no files are necessary.
          Hide
          Grant Ingersoll added a comment -

          Steps I've run:

          #../../patches/input contains the untarred tarball Mike provided
          ./mahout seqdirectory --input ../../patches/input/ --output ../../patches/output
          ./mahout seq2sparse --input ../../patches/output/ --output ../../patches/vectors -wt tf -seq
          ./mahout lda -i ../../patches/vectors -o ../../patches/lda -k 5 -v 30

          On the last command, I'm getting:

          Exception in thread "main" java.io.FileNotFoundException: File file:/Users/grantingersoll/projects/lucene/mahout/patches/vectors/tf-vectors/data does not exist.
          at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
          at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
          at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
          at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241)
          at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
          at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
          at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
          at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
          at org.apache.mahout.clustering.lda.LDADriver.runIteration(LDADriver.java:265)
          at org.apache.mahout.clustering.lda.LDADriver.run(LDADriver.java:166)
          at org.apache.mahout.clustering.lda.LDADriver.run(LDADriver.java:144)

          Show
          Grant Ingersoll added a comment - Steps I've run: #../../patches/input contains the untarred tarball Mike provided ./mahout seqdirectory --input ../../patches/input/ --output ../../patches/output ./mahout seq2sparse --input ../../patches/output/ --output ../../patches/vectors -wt tf -seq ./mahout lda -i ../../patches/vectors -o ../../patches/lda -k 5 -v 30 On the last command, I'm getting: Exception in thread "main" java.io.FileNotFoundException: File file:/Users/grantingersoll/projects/lucene/mahout/patches/vectors/tf-vectors/data does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241) at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779) at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447) at org.apache.mahout.clustering.lda.LDADriver.runIteration(LDADriver.java:265) at org.apache.mahout.clustering.lda.LDADriver.run(LDADriver.java:166) at org.apache.mahout.clustering.lda.LDADriver.run(LDADriver.java:144)
          Hide
          Grant Ingersoll added a comment -

          Mike,

          See https://cwiki.apache.org/confluence/display/MAHOUT/How+To+Contribute.

          If you can put up a test, that would be great. I started on one, but was mostly working on actually running LDA on your data set and am getting the error above. Still investigating this.

          Show
          Grant Ingersoll added a comment - Mike, See https://cwiki.apache.org/confluence/display/MAHOUT/How+To+Contribute . If you can put up a test, that would be great. I started on one, but was mostly working on actually running LDA on your data set and am getting the error above. Still investigating this.
          Hide
          Grant Ingersoll added a comment -

          Figured out the FileNotFoundException, the input actually needs to be a directory below the output of the seq2sparse, as in patches/vectors/tf-vectors

          Show
          Grant Ingersoll added a comment - Figured out the FileNotFoundException, the input actually needs to be a directory below the output of the seq2sparse, as in patches/vectors/tf-vectors
          Hide
          Grant Ingersoll added a comment -

          Finally got output, but not the same as Mike.

          I ran:

          ./mahout lda -i ../../patches/vectors/tf-vectors -o ../../patches/lda -k 5 -v 25
          ./mahout ldatopics -i ../../patches/lda/state-25 -d ../../patches/vectors/dictionary.file-0 -dt sequencefile

          I believe my vectors are setup correctly.

          Also, Mike, in reading your PDF (Toy Pyramids), is that a typo at the end of first column page one: wording says 5 topics with 7 word, but the table shows 9 words per topic (If I'm reading correctly)

          Show
          Grant Ingersoll added a comment - Finally got output, but not the same as Mike. I ran: ./mahout lda -i ../../patches/vectors/tf-vectors -o ../../patches/lda -k 5 -v 25 ./mahout ldatopics -i ../../patches/lda/state-25 -d ../../patches/vectors/dictionary.file-0 -dt sequencefile I believe my vectors are setup correctly. Also, Mike, in reading your PDF (Toy Pyramids), is that a typo at the end of first column page one: wording says 5 topics with 7 word, but the table shows 9 words per topic (If I'm reading correctly)
          Hide
          Michael Lazarus added a comment -

          Hi Grant,

          Yes, there should be 9 words per topic. But typically, only the first 7 words
          come back in the right order for any given phi vector describing Pr[word | topic
          = z]. For example, if we are looking at topic D then we should get results

          Pr[word = d | topic = D] > Pr[word = c | topic = D], Pr[word = e | topic = D] >
          Pr[word = b | topic = D], Pr[word = f | topic = D] > Pr[word = a | topic = D],
          Pr[word = g | topic = D]

          I am working on the unit test which shows proper output with Blei's c
          distribution. You will be able to run that to see the correct output.

          Thanks,
          Mike.

          Show
          Michael Lazarus added a comment - Hi Grant, Yes, there should be 9 words per topic. But typically, only the first 7 words come back in the right order for any given phi vector describing Pr[word | topic = z]. For example, if we are looking at topic D then we should get results Pr [word = d | topic = D] > Pr [word = c | topic = D] , Pr [word = e | topic = D] > Pr [word = b | topic = D] , Pr [word = f | topic = D] > Pr [word = a | topic = D] , Pr [word = g | topic = D] I am working on the unit test which shows proper output with Blei's c distribution. You will be able to run that to see the correct output. Thanks, Mike.
          Hide
          Grant Ingersoll added a comment -

          Mike, any progress on this? We are fast approaching the 0.5 deadline and I am unable to reproduce the results you have given the steps I ran above.

          Show
          Grant Ingersoll added a comment - Mike, any progress on this? We are fast approaching the 0.5 deadline and I am unable to reproduce the results you have given the steps I ran above.
          Hide
          Michael Lazarus added a comment -

          ... will try and finish the unit test today.

          Show
          Michael Lazarus added a comment - ... will try and finish the unit test today.
          Hide
          Michael Lazarus added a comment -

          I am going to have to tackle this on the weekend.

          It is very difficult for me to do this work during the week.

          Show
          Michael Lazarus added a comment - I am going to have to tackle this on the weekend. It is very difficult for me to do this work during the week.
          Hide
          Sean Owen added a comment -

          Being not obviously a blocker, I think it's fine to say take your time and let's look at this for 0.6. If by magic it's done for 0.5, great.

          Show
          Sean Owen added a comment - Being not obviously a blocker, I think it's fine to say take your time and let's look at this for 0.6. If by magic it's done for 0.5, great.
          Hide
          Grant Ingersoll added a comment -

          Michael, any luck on the unit tests?

          Show
          Grant Ingersoll added a comment - Michael, any luck on the unit tests?
          Hide
          Michael Lazarus added a comment -

          I am coming to the end of one month sprint and have been unable to get to this.
          What I am planning to do is implement a unit test which uses Blei's
          implementation to decode a dataset that was encoded with a plate model that is
          the same as the LDA plate model. That will provide an instructional unit test.
          I have started the unit test and have not finished. Maybe this week.

          Show
          Michael Lazarus added a comment - I am coming to the end of one month sprint and have been unable to get to this. What I am planning to do is implement a unit test which uses Blei's implementation to decode a dataset that was encoded with a plate model that is the same as the LDA plate model. That will provide an instructional unit test. I have started the unit test and have not finished. Maybe this week.
          Hide
          Grant Ingersoll added a comment -

          We might think about marking this as "Won't fix" when Jake's LDA changes come in.

          Show
          Grant Ingersoll added a comment - We might think about marking this as "Won't fix" when Jake's LDA changes come in.
          Hide
          Michael Lazarus added a comment -

          We should mark it as won't fix.  I can add my unit tests to Jake's implementation when it comes in and I have time, if you like.  It looks like he is taking a good approach by distributing the collapsed Gibbs sampling and then by optimizing the sampling of the Markov chain which easily provides a 10x scale up improvement.  That works well.

          Show
          Michael Lazarus added a comment - We should mark it as won't fix.  I can add my unit tests to Jake's implementation when it comes in and I have time, if you like.  It looks like he is taking a good approach by distributing the collapsed Gibbs sampling and then by optimizing the sampling of the Markov chain which easily provides a 10x scale up improvement.  That works well.
          Hide
          Jake Mannix added a comment -

          So after more carefully reading your PDF here, Mike, I think this ticket should get a better response than "wont fix" regardless of what code is in the project. It's a well-stated problem, and a really nice way to look at verifying the code works as advertised. It should be a unit test, and it should pass.

          I'll try it out on my new code, and see if I can a) get it to pass, and b) include it in the codebase. It's a very compact data set, totally easy to see it run as part of our test suite.

          Show
          Jake Mannix added a comment - So after more carefully reading your PDF here, Mike, I think this ticket should get a better response than "wont fix" regardless of what code is in the project. It's a well-stated problem, and a really nice way to look at verifying the code works as advertised. It should be a unit test, and it should pass. I'll try it out on my new code, and see if I can a) get it to pass, and b) include it in the codebase. It's a very compact data set, totally easy to see it run as part of our test suite.
          Hide
          Jake Mannix added a comment -

          Of course, it should be noted, that the code I'll be running it against is different in algorithm detail than both Mike's code (collapsed gibbs sampling) and David's original implementation here (based on variational bayes), as it's a parallel version of an approximate collapsed variational bayes (c.f. the algorithm labeled "cvb0" here: http://www.datalab.uci.edu/papers/uai_2009.pdf )

          Show
          Jake Mannix added a comment - Of course, it should be noted, that the code I'll be running it against is different in algorithm detail than both Mike's code (collapsed gibbs sampling) and David's original implementation here (based on variational bayes), as it's a parallel version of an approximate collapsed variational bayes (c.f. the algorithm labeled "cvb0" here: http://www.datalab.uci.edu/papers/uai_2009.pdf )
          Hide
          Jake Mannix added a comment -

          So now I think I'm ready to start putting some patches up, because I've got a unit test which lets you dynamically (not piles of files) create these kinds of synthetic data sets, and verify convergence on them, giving you pretty pictures like the attached (which was generated after running this against my GitHub branch code).

          Show
          Jake Mannix added a comment - So now I think I'm ready to start putting some patches up, because I've got a unit test which lets you dynamically (not piles of files) create these kinds of synthetic data sets, and verify convergence on them, giving you pretty pictures like the attached (which was generated after running this against my GitHub branch code).
          Hide
          Michael Lazarus added a comment -

          That is perfect.  The perplexity is minimal at 5 topics as it should be.  Very nice.

          Show
          Michael Lazarus added a comment - That is perfect.  The perplexity is minimal at 5 topics as it should be.  Very nice.
          Hide
          Jake Mannix added a comment -

          This adds a full end-to-end "unit" test which verifies correctness of the current LDA code, in that the (self-reported) perplexity is lowest, when using this kind of synthetic data set, when the number of topics is equal to the number of generating topics.

          The test is highly parametrizable: chose number of terms, number of generating topics, number of documents in the test corpus, number of topics per document, and size of each document, as well as hook to put in the functional form of "decay" in the generating model depending on how you want the test model to look.

          New test currently passes on 26 terms, 5 topics, and 500 documents with one topic per doc.

          Show
          Jake Mannix added a comment - This adds a full end-to-end "unit" test which verifies correctness of the current LDA code, in that the (self-reported) perplexity is lowest, when using this kind of synthetic data set, when the number of topics is equal to the number of generating topics. The test is highly parametrizable: chose number of terms, number of generating topics, number of documents in the test corpus, number of topics per document, and size of each document, as well as hook to put in the functional form of "decay" in the generating model depending on how you want the test model to look. New test currently passes on 26 terms, 5 topics, and 500 documents with one topic per doc.
          Hide
          Jake Mannix added a comment -

          While it appears that current trunk Mahout LDA correctly converges on this toy problem, I'm reopening this to track the need for this unit test to verify that this is the case.

          Show
          Jake Mannix added a comment - While it appears that current trunk Mahout LDA correctly converges on this toy problem, I'm reopening this to track the need for this unit test to verify that this is the case.
          Hide
          Jake Mannix added a comment -

          Ah, not sure what happened, but the current trunk LDA is now failing this test, while the new one is not. Marking the old lda test with @Ignore("MAHOUT-399") to track it for now.

          Show
          Jake Mannix added a comment - Ah, not sure what happened, but the current trunk LDA is now failing this test, while the new one is not. Marking the old lda test with @Ignore(" MAHOUT-399 ") to track it for now.
          Hide
          Grant Ingersoll added a comment -

          Jake, what's the status on this?

          Show
          Grant Ingersoll added a comment - Jake, what's the status on this?
          Hide
          Jake Mannix added a comment -

          Haven't really looked at it. I'd say that the original Mahout LDA (David Hall's version) has corner cases where it doesn't converge properly, even on a clearly defined topic-derived small corpus. This test passes correctly for the new LDA impl (CVB0). We can close this one as "fixed in one impl, won't fix in another" and open another JIRA ticket for 0.7 which is "remove old LDA" once we verify that users have tried the new one on a variety of data sets and like it better. Right now we're going on the fact that I (and my coworkers) have used this well in-house. Not a lot of verification to go on, but I'd even feel comfortable removing the old LDA in 0.7 even if we don't get a lot of test feedback from other people, but I'm open to discussion on that.

          Show
          Jake Mannix added a comment - Haven't really looked at it. I'd say that the original Mahout LDA (David Hall's version) has corner cases where it doesn't converge properly, even on a clearly defined topic-derived small corpus. This test passes correctly for the new LDA impl (CVB0). We can close this one as "fixed in one impl, won't fix in another" and open another JIRA ticket for 0.7 which is "remove old LDA" once we verify that users have tried the new one on a variety of data sets and like it better. Right now we're going on the fact that I (and my coworkers) have used this well in-house. Not a lot of verification to go on, but I'd even feel comfortable removing the old LDA in 0.7 even if we don't get a lot of test feedback from other people, but I'm open to discussion on that.
          Hide
          Grant Ingersoll added a comment -

          You had re-opened this issue, implying you had more work to be done.

          +1 for removing old LDA, as I don't know how much it was used anyway due to the scalability issues.

          Show
          Grant Ingersoll added a comment - You had re-opened this issue, implying you had more work to be done. +1 for removing old LDA, as I don't know how much it was used anyway due to the scalability issues.
          Hide
          Grant Ingersoll added a comment -

          I guess the question is, do you feel comfortable resolving this issue so that we can get on to 0.6? Or should we move it to 0.7

          Show
          Grant Ingersoll added a comment - I guess the question is, do you feel comfortable resolving this issue so that we can get on to 0.6? Or should we move it to 0.7
          Hide
          Jake Mannix added a comment -

          My feeling is that users deserve to know that this issue does exist in some form for the current lda in Mahout. For that reason, if it was a JIRA ticket at work, it would be left open for 0.7, to be resolved one way or another in the next release, not closed.

          Show
          Jake Mannix added a comment - My feeling is that users deserve to know that this issue does exist in some form for the current lda in Mahout. For that reason, if it was a JIRA ticket at work, it would be left open for 0.7, to be resolved one way or another in the next release, not closed.
          Hide
          Jake Mannix added a comment -

          Per comments, pushing to 0.7 still open.

          Show
          Jake Mannix added a comment - Per comments, pushing to 0.7 still open.
          Hide
          Ted Dunning added a comment -

          What say we close this as resolved by the new LDA implementations that Jake has worked out?

          This should probably involve the removal of the old implementation.

          Show
          Ted Dunning added a comment - What say we close this as resolved by the new LDA implementations that Jake has worked out? This should probably involve the removal of the old implementation.
          Hide
          Sebastian Schelter added a comment -

          +1

          Show
          Sebastian Schelter added a comment - +1
          Hide
          Ted Dunning added a comment -

          This issue will be moot when MAHOUT-1009 goes in.

          Show
          Ted Dunning added a comment - This issue will be moot when MAHOUT-1009 goes in.

            People

            • Assignee:
              Jake Mannix
              Reporter:
              Michael Lazarus
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Due:
                Created:
                Updated:
                Resolved:

                Development