Uploaded image for project: 'Hama'
  1. Hama
  2. HAMA-420

Generate random data for Pagerank example

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • None
    • examples
    • None

    Description

      As stated in comment on whirrs jira:
      https://issues.apache.org/jira/browse/WHIRR-355

      We should generate a big file (1-5gb?) for PageRank example. We wanted to add this as a part of the contrib, but we skipped/lost it somehow.

      I started crawling several pages, starting from google news. But then my free Amazon EC2 qouta expired and had to stop the crawl.

      > We need some cloud to crawl
      > We need a place to make the data available

      The stuff we need is already coded here:
      http://code.google.com/p/hama-shortest-paths/source/browse/#svn%2Ftrunk%2Fhama-gsoc%2Fsrc%2Fde%2Fjungblut%2Fcrawl

      Afterwards a m/r processing job in the subpackage "processing" has to be run on the output of the crawler. This job takes care that the adjacency matrix is valid.

      Attachments

        Issue Links

          Activity

            People

              thomas.jungblut Thomas Jungblut
              thomas.jungblut Thomas Jungblut
              Votes:
              1 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: