[SPARK-12781] MLib FPGrowth does not scale to large numbers of frequent items - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Not A Problem
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

See some background discussion here: http://stackoverflow.com/questions/34690682/spark-mlib-fpgrowth-job-fails-with-memory-error/

The FPGrowth mode's run() method seems to do the following:

Count items
Generate frequent items
Generate frequent item sets

The model is trained based on the outcome of the above. When generating frequent items, the code does the following:

data.flatMap { t =>
val uniq = t.toSet
if (t.size != uniq.size) {
throw new SparkException(s"Items in a transaction must be unique but got ${t.toSeq}.")
}
t
}.map(v => (v, 1L))
.reduceByKey(partitioner, _ + _)
.filter(_._2 >= minCount)
.collect()
.sortBy(-_._2)
.map(_._1)

The collect() call in the snippet above is causing my executors to blow past any amount of memory I can give them. Is there a way to write genFreqItems() and genFreqItemsets() so they won't try to collect all frequent items in memory?

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Raj Tiwari

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 12/Jan/16 19:21

Updated:: 12/Jan/16 22:04

Resolved:: 12/Jan/16 21:51