[SPARK-14371] OnlineLDAOptimizer should not collect stats for each doc in mini-batch to driver - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.3.0
Component/s: MLlib
Labels:
None

Description

See this line: https://github.com/apache/spark/blob/5743c6476dbef50852b7f9873112a2d299966ebd/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L437

The second element in each row of "stats" is a list with one Vector for each document in the mini-batch. Those are collected to the driver in this line:
https://github.com/apache/spark/blob/5743c6476dbef50852b7f9873112a2d299966ebd/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L456

We should not collect those to the driver. Rather, we should do the necessary maps and aggregations in a distributed manner. This will involve modify the Dirichlet expectation implementation. (This JIRA should be done by someone knowledge about online LDA and Spark.)

Attachments

Issue Links

blocks

SPARK-22111 OnlineLDAOptimizer should filter out empty documents beforehand

Resolved

is required by

SPARK-5572 LDA improvement listing

Resolved

links to

[Github] Pull Request #18924 (akopich)

Activity

People

Assignee:: Valeriy Avanesov

Reporter:: Joseph K. Bradley

Shepherd:: Joseph K. Bradley

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 04/Apr/16 18:20

Updated:: 19/Oct/17 07:39

Resolved:: 19/Oct/17 07:39