[MAHOUT-123] Implement Latent Dirichlet Allocation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.2
Fix Version/s: 0.2
Component/s: classic
Labels:
None

Description

(For GSoC)

Abstract:

Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning
algorithm for automatically and jointly clustering words into "topics"
and documents into mixtures of topics, and it has been successfully
applied to model change in scientific fields over time (Griffiths and
Steyver, 2004; Hall, et al. 2008). In this project, I propose to
implement a distributed variant of Latent Dirichlet Allocation using
MapReduce, and, time permitting, to investigate extensions of LDA and
possibly more efficient algorithms for distributed inference.

Detailed Description:

A topic model is, roughly, a hierarchical Bayesian model that
associates with each document a probability distribution over
"topics", which are in turn distributions over words. For instance, a
topic in a collection of newswire might include words about "sports",
such as "baseball", "home run", "player", and a document about steroid
use in baseball might include "sports", "drugs", and "politics". Note
that the labels "sports", "drugs", and "politics", are post-hoc labels
assigned by a human, and that the algorithm itself only assigns
associate words with probabilities. The task of parameter estimation
in these models is to learn both what these topics are, and which
documents employ them in what proportions.

One of the promises of unsupervised learning algorithms like Latent
Dirichlet Allocation (LDA; Blei et al, 2003) is the ability to take a
massive collections of documents and condense them down into a
collection of easily understandable topics. However, all available
open source implementations of LDA and related topics models are not
distributed, which hampers their utility. This project seeks to
correct this shortcoming.

In the literature, there have been several proposals for paralellzing
LDA. Newman, et al (2007) proposed to create an "approximate" LDA in
which each processors gets its own subset of the documents to run
Gibbs sampling over. However, Gibbs sampling is slow and stochastic by
its very nature, which is not advantageous for repeated runs. Instead,
I propose to follow Nallapati, et al. (2007) and use a variational
approximation that is fast and non-random.

References:

David M. Blei, J McAuliffe. Supervised Topic Models. NIPS, 2007.

David M. Blei , Andrew Y. Ng , Michael I. Jordan, Latent dirichlet
allocation, The Journal of Machine Learning Research, 3, p.993-1022,
3/1/2003

T. L. Griffiths and M. Steyvers. Finding scientiﬁc topics. Proc Natl
Acad Sci U S A, 101 Suppl 1: 5228-5235, April 2004.

David LW Hall, Daniel Jurafsky, and Christopher D. Manning. Studying
the History of Ideas Using Topic Models. EMNLP, Honolulu, 2008.

Ramesh Nallapati, William Cohen, John Lafferty, Parallelized
variational EM for Latent Dirichlet Allocation: An experimental
evaluation of speed and scalability, ICDM workshop on high performance
data mining, 2007.

Newman, D., Asuncion, A., Smyth, P., & Welling, M. Distributed
Inference for Latent Dirichlet Allocation. NIPS, 2007.

Xuerui Wang , Andrew McCallum, Topics over time: a non-Markov
continuous-time model of topical trends. KDD, 2006

Wolfe, J., Haghighi, A, and Klein, D. Fully distributed EM for very
large datasets. ICML, 2008.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

lda.patch
16/Jun/09 20:50
27 kB
David Leo Wright Hall
MAHOUT-123.patch
08/Aug/09 03:20
59 kB
David Leo Wright Hall
MAHOUT-123.patch
04/Aug/09 13:53
60 kB
Grant Ingersoll
MAHOUT-123.patch
03/Aug/09 19:30
58 kB
David Leo Wright Hall
MAHOUT-123.patch
02/Aug/09 19:07
57 kB
David Leo Wright Hall
MAHOUT-123.patch
23/Jul/09 00:21
46 kB
David Leo Wright Hall
MAHOUT-123.patch
02/Jul/09 01:04
43 kB
David Leo Wright Hall
MAHOUT-123.patch
29/Jun/09 18:19
33 kB
David Leo Wright Hall
MAHOUT-123.patch
29/Jun/09 18:02
0.1 kB
David Leo Wright Hall
MAHOUT-123.patch
19/Jun/09 08:00
33 kB
David Leo Wright Hall
MAHOUT-123.patch
18/Jun/09 21:13
26 kB
David Leo Wright Hall

Activity

People

Assignee:: Grant Ingersoll

Reporter:: David Leo Wright Hall

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 23/May/09 19:08

Updated:: 31/Jan/24 22:14

Resolved:: 12/Sep/09 20:13

Time Tracking

Estimated:

504h

Remaining:

504h

Logged:

Not Specified