[HAMA-420] Generate random data for Pagerank example - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Component/s: examples
Labels:
None

Description

As stated in comment on whirrs jira:
https://issues.apache.org/jira/browse/WHIRR-355

We should generate a big file (1-5gb?) for PageRank example. We wanted to add this as a part of the contrib, but we skipped/lost it somehow.

I started crawling several pages, starting from google news. But then my free Amazon EC2 qouta expired and had to stop the crawl.

> We need some cloud to crawl
> We need a place to make the data available

The stuff we need is already coded here:
http://code.google.com/p/hama-shortest-paths/source/browse/#svn%2Ftrunk%2Fhama-gsoc%2Fsrc%2Fde%2Fjungblut%2Fcrawl

Afterwards a m/r processing job in the subpackage "processing" has to be run on the output of the crawler. This job takes care that the adjacency matrix is valid.

Attachments

Issue Links

is depended upon by

WHIRR-355 [HAMA] Add Hama's PageRank example to Whirr's examples

Resolved

Activity

People

Assignee:: Thomas Jungblut

Reporter:: Thomas Jungblut

Votes:: 1 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 05/Aug/11 11:51

Updated:: 02/May/13 02:29

Resolved:: 20/Aug/11 08:26