Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-1405

parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3.0
    • Component/s: MLlib
    • Labels:
    • Target Version/s:

      Description

      Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling.

      In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core.

      Algorithm survey from Pedro: https://docs.google.com/document/d/13MfroPXEEGKgaQaZlHkg1wdJMtCN5d8aHJuVkiOrOK4/edit?usp=sharing
      API design doc from Joseph: https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                josephkb Joseph K. Bradley
                Reporter:
                yinxusen Xusen Yin
                Shepherd:
                Xiangrui Meng
              • Votes:
                6 Vote for this issue
                Watchers:
                40 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - 336h
                  336h
                  Remaining:
                  Remaining Estimate - 336h
                  336h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified