[SPARK-1405] parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.3.0
Component/s: MLlib
Labels:
- features

Target Version/s:

1.3.0

Description

Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling.

In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core.

Algorithm survey from Pedro: https://docs.google.com/document/d/13MfroPXEEGKgaQaZlHkg1wdJMtCN5d8aHJuVkiOrOK4/edit?usp=sharing
API design doc from Joseph: https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

performance_comparison.png
28/Sep/14 07:51
48 kB
Guoqiang Li

Issue Links

duplicates

SPARK-953 Latent Dirichlet Association (LDA model)

Resolved

is related to

SPARK-5556 Latent Dirichlet Allocation (LDA) using Gibbs sampler

Resolved

relates to

SPARK-2199 Distributed probabilistic latent semantic analysis in MLlib

Resolved

links to

[Github] Pull Request #476 (yinxusen)

[Github] Pull Request #1983 (witgo)

[Github] Pull Request #2388 (witgo)

[Github] Pull Request #4047 (jkbradley)

(2 links to)

Activity

People

Assignee:: Joseph K. Bradley

Reporter:: Xusen Yin

Shepherd:: Xiangrui Meng

Votes:: 6 Vote for this issue

Watchers:: 32 Start watching this issue

Dates

Created:: 03/Apr/14 06:21

Updated:: 19/Feb/16 18:10

Resolved:: 03/Feb/15 07:58

Time Tracking

Estimated:

336h

Remaining:

336h

Logged:

Not Specified