[MAHOUT-980] Patch to make PFPGrowth run on Amazon MapReduce (also shows possible pattern to make other algorithms work in Amazon MapReduce) - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.5, 0.6, 0.7
Fix Version/s: 0.7
Component/s: None
Labels:
- hadoop
- patch
Environment:

Amazon MapReduce

Description

The patch at http://www.cs.brown.edu/~matteo/PFPGrowth.java.diff (against trunk as of Wed Feb 22 00:07:35 EST 2012, revision 1292127) makes it possible to run PFPGrowth on Elastic MapReduce.

The problem was in the way the fList stored in the DistributedCache was accessed. DistributedCache.getCacheFiles(conf) should be reserved for internal use according to the Hadoop API Documentation. The suggested way to access the files in the DistributedCache is through DistributedCache.getLocalCacheFiles(conf) and then through a LocalFilesystem. This is what the patch does. Note that there is a fallback case if we are running PFPGrowth with "-method mapreduce" but locally (e.g. when HADOOP_HOME is not set or MAHOUT_LOCAL is set). In this case, we use DistributedCache.getCacheFiles() as it is done in the unpatched version.

A quick grep in the source tree shows that there are other places where DistributedCache.getCacheFiles(conf) is used. It may be worth checking whether the corresponding algorithms can be made to work in Amazon MapReduce by fixing them in a similar fashion.

The patch was tested also outside Amazon MapReduce and does not change any other functionality.

Attachments

PFPGrowth.java.diff
29/Feb/12 22:19
2 kB
Matteo Riondato

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Tom Pierce

Reporter:: Matteo Riondato

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 22/Feb/12 05:18

Updated:: 31/Mar/15 22:49

Resolved:: 12/Mar/12 19:21

Agile

View on Board

Patch to make PFPGrowth run on Amazon MapReduce (also shows possible pattern to make other algorithms work in Amazon MapReduce)

Details

Description

Attachments

Attachments

Activity

People

Dates

Agile

Slack

Issue deployment