Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-8555

Online Variational Inference for the Hierarchical Dirichlet Process

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Minor
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: MLlib
    • Labels:
      None

      Description

      The task is created for exploration on the online HDP algorithm described in
      http://jmlr.csail.mit.edu/proceedings/papers/v15/wang11a/wang11a.pdf.

      Major advantage for the algorithm: one pass on corpus, streaming friendly, automatic K (topic number).

      Currently the scope is to support online HDP for topic modeling, i.e. probably an optimizer for LDA.

        Issue Links

          Activity

          Hide
          yuhaoyan yuhao yang added a comment -

          A basic implementation on https://github.com/hhbyyh/HDP, which still needs a lot of improvement and evaluation on performance and scalability.

          Show
          yuhaoyan yuhao yang added a comment - A basic implementation on https://github.com/hhbyyh/HDP , which still needs a lot of improvement and evaluation on performance and scalability.
          Hide
          tund Tu Dinh Nguyen added a comment -

          Hi,

          I'm Tu from Deakin University. Our team is currently working intensively to develop this model on Spark.
          We also would like to integrate into the MLLib. Could anyone please tell me what the workflow looks like?

          Your help is much appreciated!

          Show
          tund Tu Dinh Nguyen added a comment - Hi, I'm Tu from Deakin University. Our team is currently working intensively to develop this model on Spark. We also would like to integrate into the MLLib. Could anyone please tell me what the workflow looks like? Your help is much appreciated!
          Hide
          srowen Sean Owen added a comment -

          I doubt this would be merged into Spark; at least, create the package and list it at spark-packages.org first.

          Show
          srowen Sean Owen added a comment - I doubt this would be merged into Spark; at least, create the package and list it at spark-packages.org first.
          Hide
          tund Tu Dinh Nguyen added a comment -

          Hi Sean,

          Thank you for you reply! Would you mind if I ask for the reasons why Spark is not interested in HDP?

          Show
          tund Tu Dinh Nguyen added a comment - Hi Sean, Thank you for you reply! Would you mind if I ask for the reasons why Spark is not interested in HDP?
          Hide
          srowen Sean Owen added a comment -

          Have a look at https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines ; generally speaking there are so many algorithms to implement and most aren't that useful or widely used, and so few really belong in MLlib itself. I'm not commenting on HDP here, though I don't think it's that commonly used. The idea is that it should prove itself out externally.

          Show
          srowen Sean Owen added a comment - Have a look at https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines ; generally speaking there are so many algorithms to implement and most aren't that useful or widely used, and so few really belong in MLlib itself. I'm not commenting on HDP here, though I don't think it's that commonly used. The idea is that it should prove itself out externally.
          Hide
          tund Tu Dinh Nguyen added a comment -

          Oh, I see. Thank you for pointing this out!

          Show
          tund Tu Dinh Nguyen added a comment - Oh, I see. Thank you for pointing this out!
          Hide
          srowen Sean Owen added a comment -

          For now, I think we should assume this should be implemented outside Spark first and provided as a package, given lack of traction.

          Show
          srowen Sean Owen added a comment - For now, I think we should assume this should be implemented outside Spark first and provided as a package, given lack of traction.

            People

            • Assignee:
              Unassigned
              Reporter:
              yuhaoyan yuhao yang
            • Votes:
              2 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development