Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-5556

Latent Dirichlet Allocation (LDA) using Gibbs sampler

    Details

    • Type: New Feature
    • Status: In Progress
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: MLlib
    • Labels:
      None
    1. spark-summit.pptx
      876 kB
      Guoqiang Li
    2. LDA_test.xlsx
      210 kB
      Guoqiang Li

      Issue Links

        Activity

        Hide
        ding ding added a comment -

        We have made the spark package and it can be find here http://spark-packages.org/package/intel-analytics/TopicModeling

        Show
        ding ding added a comment - We have made the spark package and it can be find here http://spark-packages.org/package/intel-analytics/TopicModeling
        Hide
        jason.dai Jason Dai added a comment -

        Pedro Rodriguez We'll try to make a spark package based on our repo; please help take a look at the code and provide your feedback. Please let us know if there are anything we may collaborate for LDA/topic modeling on Spark.

        Show
        jason.dai Jason Dai added a comment - Pedro Rodriguez We'll try to make a spark package based on our repo; please help take a look at the code and provide your feedback. Please let us know if there are anything we may collaborate for LDA/topic modeling on Spark.
        Hide
        pedrorodriguez Pedro Rodriguez added a comment -

        That is awesome. I've been a bit busy moving/starting a PhD at CU Boulder so sorry for not responding sooner. The good news is that I will very likely be working on LDA related research

        I haven't made more progress on the LDA package beyond creating the github repo and spark package (on the packages website). I will take a look at the repo this week. It would be great if what you have worked on can be made a package.

        Show
        pedrorodriguez Pedro Rodriguez added a comment - That is awesome. I've been a bit busy moving/starting a PhD at CU Boulder so sorry for not responding sooner. The good news is that I will very likely be working on LDA related research I haven't made more progress on the LDA package beyond creating the github repo and spark package (on the packages website). I will take a look at the repo this week. It would be great if what you have worked on can be made a package.
        Hide
        ding ding added a comment -

        The code can be found https://github.com/intel-analytics/TopicModeling.

        There is an example in the package, you can try gibbs sampling lda or online lda by setting --optimizer as "gibbs" or "online"

        Show
        ding ding added a comment - The code can be found https://github.com/intel-analytics/TopicModeling . There is an example in the package, you can try gibbs sampling lda or online lda by setting --optimizer as "gibbs" or "online"
        Hide
        jason.dai Jason Dai added a comment -

        Sure; we can share our code on github, and then try to make a Spark package

        Show
        jason.dai Jason Dai added a comment - Sure; we can share our code on github, and then try to make a Spark package
        Hide
        josephkb Joseph K. Bradley added a comment -

        Wow that sounds awesome. Would you be able to share your code as a Spark package?

        Show
        josephkb Joseph K. Bradley added a comment - Wow that sounds awesome. Would you be able to share your code as a Spark package?
        Hide
        jason.dai Jason Dai added a comment -

        Pedro Rodriguez I wonder if you have made any progress on the LDA package.

        We have actually built a package of topic modeling algorithms for our use cases, which contains Gibbs Sampling LDA (adapted to the MLlib LDA interface based on the PRs/codes from Pedro Rodriguez and Guoqiang Li, including AliasLDA, SparseLDA, LightLDA and FastLDA algorithms), as well as Hierarchical LDA (implemented by yuhao yang from our team). We can also share that package for people to try different algorithms.

        Show
        jason.dai Jason Dai added a comment - Pedro Rodriguez I wonder if you have made any progress on the LDA package. We have actually built a package of topic modeling algorithms for our use cases, which contains Gibbs Sampling LDA (adapted to the MLlib LDA interface based on the PRs/codes from Pedro Rodriguez and Guoqiang Li , including AliasLDA, SparseLDA, LightLDA and FastLDA algorithms), as well as Hierarchical LDA (implemented by yuhao yang from our team). We can also share that package for people to try different algorithms.
        Hide
        josephkb Joseph K. Bradley added a comment -

        Right, I don't think there has been much change, though Guoqiang Li could perhaps say more. I think starting as a Spark package would be great; as you say, it would allow testing of different variants. A reasonable timeline might be to add a Spark package around the 1.5 release and port the best version to Spark for the 1.5 release. Please ping if you'd like feedback. Thank you!

        Show
        josephkb Joseph K. Bradley added a comment - Right, I don't think there has been much change, though Guoqiang Li could perhaps say more. I think starting as a Spark package would be great; as you say, it would allow testing of different variants. A reasonable timeline might be to add a Spark package around the 1.5 release and port the best version to Spark for the 1.5 release. Please ping if you'd like feedback. Thank you!
        Hide
        pedrorodriguez Pedro Rodriguez added a comment -

        I am still interested, but was unsure of the status of other implementations. Given not much new, perhaps I should go ahead with it?

        Last week I was also considering the possibility of making a Spark package for LDA. The aims would be threefold: have more algorithms (I have been contact by a couple researchers basing their work on the Gibbs LDA I worked on, plus I will likely be using it in my own PhD starting this fall), a good place for relatively new/unproven variants, and then pull the best into spark. I have been pretty busy so haven't gotten around to that, but it has been on my mind.

        When I get time this week, I will take a look at the current source and what I have to see how much work it would take to get to something that could make a pull request. Although FastLDA/LightLDA might be algorithmically better, I think that what I have would be a good starting place at least.

        Show
        pedrorodriguez Pedro Rodriguez added a comment - I am still interested, but was unsure of the status of other implementations. Given not much new, perhaps I should go ahead with it? Last week I was also considering the possibility of making a Spark package for LDA. The aims would be threefold: have more algorithms (I have been contact by a couple researchers basing their work on the Gibbs LDA I worked on, plus I will likely be using it in my own PhD starting this fall), a good place for relatively new/unproven variants, and then pull the best into spark. I have been pretty busy so haven't gotten around to that, but it has been on my mind. When I get time this week, I will take a look at the current source and what I have to see how much work it would take to get to something that could make a pull request. Although FastLDA/LightLDA might be algorithmically better, I think that what I have would be a good starting place at least.
        Hide
        josephkb Joseph K. Bradley added a comment -

        Pedro Rodriguez I'm going to remove the target version of this JIRA for now, but please do let me know if you're interested in picking this up again. We should definitely add Gibbs sampling in eventually.

        Show
        josephkb Joseph K. Bradley added a comment - Pedro Rodriguez I'm going to remove the target version of this JIRA for now, but please do let me know if you're interested in picking this up again. We should definitely add Gibbs sampling in eventually.
        Hide
        gq Guoqiang Li added a comment -

        FastLDA: Gibbs sampling,The computational complexity is O(n_dk), n_dk is the number of topic (unique) in document d. I recommend to be used for short text
        LightLDA Metropolis Hasting sampling The computational complexity is O(1)(It depends on the partition strategy and takes up more memory).

        Show
        gq Guoqiang Li added a comment - FastLDA : Gibbs sampling,The computational complexity is O(n_dk), n_dk is the number of topic (unique) in document d. I recommend to be used for short text LightLDA Metropolis Hasting sampling The computational complexity is O(1)(It depends on the partition strategy and takes up more memory).
        Hide
        pedrorodriguez Pedro Rodriguez added a comment -

        What are thoughts on implementation?It looks like LightLDA converges faster and takes more memory, but FastLDA is slightly faster. Could you give a good summary of comparing the different algorithms Guoqiang Li? I went through the data and plots, but some interpretation would be great.

        How should we move forward on choosing an implementation? It makes more sense to decide on something, then work on merging that choice rather than preparing multiple choices. On my Gibbs implementation, I am working on the assumption that algorithmically it is the same as Guoqiang Li's and should perform comparably.

        Show
        pedrorodriguez Pedro Rodriguez added a comment - What are thoughts on implementation?It looks like LightLDA converges faster and takes more memory, but FastLDA is slightly faster. Could you give a good summary of comparing the different algorithms Guoqiang Li ? I went through the data and plots, but some interpretation would be great. How should we move forward on choosing an implementation? It makes more sense to decide on something, then work on merging that choice rather than preparing multiple choices. On my Gibbs implementation, I am working on the assumption that algorithmically it is the same as Guoqiang Li 's and should perform comparably.
        Hide
        gq Guoqiang Li added a comment -

        spark-summit.pptx has introduced the relevant algorithm

        Show
        gq Guoqiang Li added a comment - spark-summit.pptx has introduced the relevant algorithm
        Hide
        gq Guoqiang Li added a comment -

        LDA_Gibbs combines the advantages of AliasLDA, FastLDA and SparseLDA algorithm. The corresponding code is https://github.com/witgo/spark/tree/lda_Gibbs or https://github.com/witgo/zen/blob/master/ml/src/main/scala/com/github/cloudml/zen/ml/clustering/LDA.scala#L553.

        Yes LightLDA converge faster,but it takes up more memory

        Show
        gq Guoqiang Li added a comment - LDA_Gibbs combines the advantages of AliasLDA, FastLDA and SparseLDA algorithm. The corresponding code is https://github.com/witgo/spark/tree/lda_Gibbs or https://github.com/witgo/zen/blob/master/ml/src/main/scala/com/github/cloudml/zen/ml/clustering/LDA.scala#L553 . Yes LightLDA converge faster,but it takes up more memory
        Hide
        gq Guoqiang Li added a comment -

        I put the latest LDA code in Zen
        The test results here (72 cores, 216G ram, 6 servers, Gigabit Ethernet)

        Show
        gq Guoqiang Li added a comment - I put the latest LDA code in Zen The test results here (72 cores, 216G ram, 6 servers, Gigabit Ethernet)
        Hide
        pedrorodriguez Pedro Rodriguez added a comment -

        Guoqiang Li is the LDAGibbs line what I implemented or something else? In any case, the optimization on sampling shouldn't change the results, so it looks like LightLDA converges to a better perplexity.

        Do you have any performance graphs?

        Show
        pedrorodriguez Pedro Rodriguez added a comment - Guoqiang Li is the LDAGibbs line what I implemented or something else? In any case, the optimization on sampling shouldn't change the results, so it looks like LightLDA converges to a better perplexity. Do you have any performance graphs?
        Hide
        josephkb Joseph K. Bradley added a comment -

        That plan sounds good. I haven't yet been able to look into LightLDA, but it would be good to understand if it's (a) a modification which could be added to Gibbs later on or (b) an algorithm which belongs as a separate algorithm.

        Show
        josephkb Joseph K. Bradley added a comment - That plan sounds good. I haven't yet been able to look into LightLDA, but it would be good to understand if it's (a) a modification which could be added to Gibbs later on or (b) an algorithm which belongs as a separate algorithm.
        Hide
        pedrorodriguez Pedro Rodriguez added a comment -

        I will start working on it again then. It would be great for that research project to result in Gibbs being added. The refactoring ended up roadblocking that quite a bit.

        I think Guoqiang Li was working on something called LightLDA. I don't know the specifics of the algorithm, but the sampler scales theoretically O(1) with topics. My implementation has something which in the testing I did looks like in practice it is O(1) or very near it.

        To get Gibbs merged in (or as a candidate implementation), how does this look:
        1. Refactor code to fit the PR that you just merged
        2. Use the testing harness you used for the EM LDA to test with the same conditions. This should be fairly easy since you already did all the work to get things pipelining correctly.
        3. If it scales well, then merge or consider other applications
        4. Code review somewhere in there.

        Show
        pedrorodriguez Pedro Rodriguez added a comment - I will start working on it again then. It would be great for that research project to result in Gibbs being added. The refactoring ended up roadblocking that quite a bit. I think Guoqiang Li was working on something called LightLDA. I don't know the specifics of the algorithm, but the sampler scales theoretically O(1) with topics. My implementation has something which in the testing I did looks like in practice it is O(1) or very near it. To get Gibbs merged in (or as a candidate implementation), how does this look: 1. Refactor code to fit the PR that you just merged 2. Use the testing harness you used for the EM LDA to test with the same conditions. This should be fairly easy since you already did all the work to get things pipelining correctly. 3. If it scales well, then merge or consider other applications 4. Code review somewhere in there.
        Hide
        josephkb Joseph K. Bradley added a comment -

        Great! I'm not aware of blockers. As far as other active implementations, the only ones I know of are those reference by Guoqiang Li above. Please do ping him on your work and see if there are ideas which can be merged. We can help with the coordination and discussions as well. Thanks!

        Show
        josephkb Joseph K. Bradley added a comment - Great! I'm not aware of blockers. As far as other active implementations, the only ones I know of are those reference by Guoqiang Li above. Please do ping him on your work and see if there are ideas which can be merged. We can help with the coordination and discussions as well. Thanks!
        Hide
        pedrorodriguez Pedro Rodriguez added a comment -

        With the refactoring done, I can get to working on getting the core code running on that interface.

        Does it seem likely if that is completed, gibbs will get merged for 1.5. Are there any foreseeable blockers or potential different implementations that are being considered?

        Show
        pedrorodriguez Pedro Rodriguez added a comment - With the refactoring done, I can get to working on getting the core code running on that interface. Does it seem likely if that is completed, gibbs will get merged for 1.5. Are there any foreseeable blockers or potential different implementations that are being considered?
        Hide
        josephkb Joseph K. Bradley added a comment -

        I updated the target version for this. I hope we can get it in for the next release!

        Show
        josephkb Joseph K. Bradley added a comment - I updated the target version for this. I hope we can get it in for the next release!
        Hide
        pedrorodriguez Pedro Rodriguez added a comment -

        Based on initial testing, I recall FastLDA in practice being O(1), should be able to confirm that at a larger scale test soon. LightLDA definitely worth looking into I think, at this point though my focus is on getting the FastLDA Gibbs to a mergable state (tests pass, refactoring/api for LDA is good, and performs at scale as good as or better than EM).

        Show
        pedrorodriguez Pedro Rodriguez added a comment - Based on initial testing, I recall FastLDA in practice being O(1), should be able to confirm that at a larger scale test soon. LightLDA definitely worth looking into I think, at this point though my focus is on getting the FastLDA Gibbs to a mergable state (tests pass, refactoring/api for LDA is good, and performs at scale as good as or better than EM).
        Hide
        gq Guoqiang Li added a comment - - edited

        This branch's computational complexity is O(Ndk),
        Ndk is the number of topic (unique) in document d

        Show
        gq Guoqiang Li added a comment - - edited This branch 's computational complexity is O(Ndk), Ndk is the number of topic (unique) in document d
        Hide
        pedrorodriguez Pedro Rodriguez added a comment -

        See PR for info, TLDR: contains refactoring for multiple LDA algorithms, including how EM would be refactored. Will in the near future contain Gibbs implementation I have/had been working on.

        Show
        pedrorodriguez Pedro Rodriguez added a comment - See PR for info, TLDR: contains refactoring for multiple LDA algorithms, including how EM would be refactored. Will in the near future contain Gibbs implementation I have/had been working on.
        Hide
        apachespark Apache Spark added a comment -

        User 'EntilZha' has created a pull request for this issue:
        https://github.com/apache/spark/pull/4807

        Show
        apachespark Apache Spark added a comment - User 'EntilZha' has created a pull request for this issue: https://github.com/apache/spark/pull/4807
        Hide
        pedrorodriguez Pedro Rodriguez added a comment -

        I will read that paper, seems interesting. Probably worth discussing at some point, how is the philosophy behind supporting different algorithms? It seems like there are a good number (at least 2 Gibbs, 1 EM right now). On the same line of thought, perhaps it would be better to open two pull requests, one which refactors the current LDA to allow multiple algorithms, and a second for the Gibbs itself? Thoughts?

        Show
        pedrorodriguez Pedro Rodriguez added a comment - I will read that paper, seems interesting. Probably worth discussing at some point, how is the philosophy behind supporting different algorithms? It seems like there are a good number (at least 2 Gibbs, 1 EM right now). On the same line of thought, perhaps it would be better to open two pull requests, one which refactors the current LDA to allow multiple algorithms, and a second for the Gibbs itself? Thoughts?
        Hide
        gq Guoqiang Li added a comment -

        LightLDA's computational complexity is O(1)
        The paper: http://arxiv.org/abs/1412.1576
        The code(work in progress): https://github.com/witgo/spark/tree/LightLDA

        Show
        gq Guoqiang Li added a comment - LightLDA's computational complexity is O(1) The paper: http://arxiv.org/abs/1412.1576 The code(work in progress): https://github.com/witgo/spark/tree/LightLDA
        Hide
        pedrorodriguez Pedro Rodriguez added a comment -

        Posting here as a status update. I will be working on and opening a pull request for adding a collapsed Gibbs sampling version which uses FastLDA for super linear scaling with number of topics. Below is the design document (same as from the original LDA JIRA issue), along with the repository/branch I am working on.
        https://docs.google.com/document/d/13MfroPXEEGKgaQaZlHkg1wdJMtCN5d8aHJuVkiOrOK4/edit?usp=sharing

        https://github.com/EntilZha/spark/tree/LDA-Refactor

        Tasks

        • Rebase from the merged implementation, refactor appropriately
        • Merge/implement the required inheritance/trait/abstract classes to support two implementations (EM and Gibbs) using only the entry points exposed in the EM version, plus an optional argument to select between EM/Gibbs.
        • Do performance tests comparable to those run for EM LDA.

        Some details for inheritance/trait/abstract:
        General idea would be to create an API which LDA implementations must satisfy using a trait/abstract class. All implementation details would be encapsulated within a state object satisfying the trait/abstract class. LDA would be responsible for creating an EM or Gibbs state object based on a user argument switch/flag. Linked below is a sample implementation based on an earlier version of the merged EM code (which needs to be updated to reflect the changes since then, but it should show the idea well enough):
        https://github.com/EntilZha/spark/blob/LDA-Refactor/mllib/src/main/scala/org/apache/spark/mllib/topicmodeling/LDA.scala#L216-L242

        Timeline: I have been busier than expected, but rebase/refactoring should be done in the next few days, then I will open a PR to get feedback while running performance tests.

        Show
        pedrorodriguez Pedro Rodriguez added a comment - Posting here as a status update. I will be working on and opening a pull request for adding a collapsed Gibbs sampling version which uses FastLDA for super linear scaling with number of topics. Below is the design document (same as from the original LDA JIRA issue), along with the repository/branch I am working on. https://docs.google.com/document/d/13MfroPXEEGKgaQaZlHkg1wdJMtCN5d8aHJuVkiOrOK4/edit?usp=sharing https://github.com/EntilZha/spark/tree/LDA-Refactor Tasks Rebase from the merged implementation, refactor appropriately Merge/implement the required inheritance/trait/abstract classes to support two implementations (EM and Gibbs) using only the entry points exposed in the EM version, plus an optional argument to select between EM/Gibbs. Do performance tests comparable to those run for EM LDA. Some details for inheritance/trait/abstract: General idea would be to create an API which LDA implementations must satisfy using a trait/abstract class. All implementation details would be encapsulated within a state object satisfying the trait/abstract class. LDA would be responsible for creating an EM or Gibbs state object based on a user argument switch/flag. Linked below is a sample implementation based on an earlier version of the merged EM code (which needs to be updated to reflect the changes since then, but it should show the idea well enough): https://github.com/EntilZha/spark/blob/LDA-Refactor/mllib/src/main/scala/org/apache/spark/mllib/topicmodeling/LDA.scala#L216-L242 Timeline: I have been busier than expected, but rebase/refactoring should be done in the next few days, then I will open a PR to get feedback while running performance tests.
        Hide
        josephkb Joseph K. Bradley added a comment -

        I believe Xiangrui Meng and Guoqiang Li have confirmed that's the plan. It would be great to get this as another algorithm option; I suspect it will perform better than EM.

        One thought: There are several possibilities for Gibbs sampling algorithms:

        • Collapsed Gibbs sampling (most common, but distributed implementations are all non-ergodic) (This is what Guoqiang Li's PR uses, I believe.)
        • Non-collapsed Gibbs sampling (not common, but easy to make an ergodic distributed implementation)
        • Modified versions of the LDA model designed for distributed computation, such as HD-LDA in Newman et al. “Distributed Algorithms for Topic Models.” JMLR, 2009.
        Show
        josephkb Joseph K. Bradley added a comment - I believe Xiangrui Meng and Guoqiang Li have confirmed that's the plan. It would be great to get this as another algorithm option; I suspect it will perform better than EM. One thought: There are several possibilities for Gibbs sampling algorithms: Collapsed Gibbs sampling (most common, but distributed implementations are all non-ergodic) (This is what Guoqiang Li 's PR uses, I believe.) Non-collapsed Gibbs sampling (not common, but easy to make an ergodic distributed implementation) Modified versions of the LDA model designed for distributed computation, such as HD-LDA in Newman et al. “Distributed Algorithms for Topic Models.” JMLR, 2009.
        Hide
        srowen Sean Owen added a comment -

        To clarify, since I had to double-check too, the LDA implemented in SPARK-1405 uses EM, so this tracks adding an implementation based on Gibbs sampling using that same code base?

        Show
        srowen Sean Owen added a comment - To clarify, since I had to double-check too, the LDA implemented in SPARK-1405 uses EM, so this tracks adding an implementation based on Gibbs sampling using that same code base?

          People

          • Assignee:
            pedrorodriguez Pedro Rodriguez
            Reporter:
            gq Guoqiang Li
            Shepherd:
            Joseph K. Bradley
          • Votes:
            4 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

            • Created:
              Updated:

              Development