Uploaded image for project: 'Hama'
  1. Hama
  2. HAMA-420

Generate random data for Pagerank example

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: examples
    • Labels:
      None

      Description

      As stated in comment on whirrs jira:
      https://issues.apache.org/jira/browse/WHIRR-355

      We should generate a big file (1-5gb?) for PageRank example. We wanted to add this as a part of the contrib, but we skipped/lost it somehow.

      I started crawling several pages, starting from google news. But then my free Amazon EC2 qouta expired and had to stop the crawl.

      > We need some cloud to crawl
      > We need a place to make the data available

      The stuff we need is already coded here:
      http://code.google.com/p/hama-shortest-paths/source/browse/#svn%2Ftrunk%2Fhama-gsoc%2Fsrc%2Fde%2Fjungblut%2Fcrawl

      Afterwards a m/r processing job in the subpackage "processing" has to be run on the output of the crawler. This job takes care that the adjacency matrix is valid.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                thomas.jungblut Thomas Jungblut
                Reporter:
                thomas.jungblut Thomas Jungblut
              • Votes:
                1 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: