Uploaded image for project: 'Hama'
  1. Hama
  2. HAMA-420

Generate random data for Pagerank example

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: examples
    • Labels:
      None

      Description

      As stated in comment on whirrs jira:
      https://issues.apache.org/jira/browse/WHIRR-355

      We should generate a big file (1-5gb?) for PageRank example. We wanted to add this as a part of the contrib, but we skipped/lost it somehow.

      I started crawling several pages, starting from google news. But then my free Amazon EC2 qouta expired and had to stop the crawl.

      > We need some cloud to crawl
      > We need a place to make the data available

      The stuff we need is already coded here:
      http://code.google.com/p/hama-shortest-paths/source/browse/#svn%2Ftrunk%2Fhama-gsoc%2Fsrc%2Fde%2Fjungblut%2Fcrawl

      Afterwards a m/r processing job in the subpackage "processing" has to be run on the output of the crawler. This job takes care that the adjacency matrix is valid.

        Issue Links

          Activity

          Hide
          thomas.jungblut Thomas Jungblut added a comment -

          Okay started the crawler once again on amazon.

          Going with 50sites/s it will be finished soon

          @Edward, would you please take a look at the patch in WHIRR-355? Thanks.

          Show
          thomas.jungblut Thomas Jungblut added a comment - Okay started the crawler once again on amazon. Going with 50sites/s it will be finished soon @Edward, would you please take a look at the patch in WHIRR-355 ? Thanks.
          Hide
          thomas.jungblut Thomas Jungblut added a comment -

          Crawl + Processing finished.

          This is a first version of a file. Which had a lot of junk in it, so it is a very tiny snapshot.

          http://hama-shortest-paths.googlecode.com/svn/trunk/hama-gsoc/files/pagerank/import/crawled/crawled.txt

          (1.4mb).

          I started another crawl which should yield in a better result.
          But this output txt is working fine with pagerank.

          I'll update it later then...

          Show
          thomas.jungblut Thomas Jungblut added a comment - Crawl + Processing finished. This is a first version of a file. Which had a lot of junk in it, so it is a very tiny snapshot. http://hama-shortest-paths.googlecode.com/svn/trunk/hama-gsoc/files/pagerank/import/crawled/crawled.txt (1.4mb). I started another crawl which should yield in a better result. But this output txt is working fine with pagerank. I'll update it later then...
          Hide
          thomas.jungblut Thomas Jungblut added a comment -

          Started again on c1.xLarge with 1.000.000 sites.

          This is a lot faster than the free tier, but after that I'm going to be a poor person

          Show
          thomas.jungblut Thomas Jungblut added a comment - Started again on c1.xLarge with 1.000.000 sites. This is a lot faster than the free tier, but after that I'm going to be a poor person
          Hide
          udanax Edward J. Yoon added a comment -

          I propose that instead of crawling from the web we generate random graph data. What do you think?

          Show
          udanax Edward J. Yoon added a comment - I propose that instead of crawling from the web we generate random graph data. What do you think?
          Hide
          thomas.jungblut Thomas Jungblut added a comment - - edited

          This can be alot less stress. That is a good idea.

          EDIT: is there some (downloadable) dataset of URLS on the internet?

          Show
          thomas.jungblut Thomas Jungblut added a comment - - edited This can be alot less stress. That is a good idea. EDIT: is there some (downloadable) dataset of URLS on the internet?
          Hide
          chl501 ChiaHung Lin added a comment -

          Not very sure if this is what you are looking for. Open Directory provides rdf dump at http://rdf.dmoz.org/rdf/content.rdf.u8.gz.

          Show
          chl501 ChiaHung Lin added a comment - Not very sure if this is what you are looking for. Open Directory provides rdf dump at http://rdf.dmoz.org/rdf/content.rdf.u8.gz .
          Hide
          thomas.jungblut Thomas Jungblut added a comment -

          Oh yes. That is the dataset nutch is using right?
          Thanks ChiaHung!

          Show
          thomas.jungblut Thomas Jungblut added a comment - Oh yes. That is the dataset nutch is using right? Thanks ChiaHung!
          Hide
          chl501 ChiaHung Lin added a comment -

          Exactly. : ) And its size is a bit huge, so probably the subset of the content is needed if this is what we are looking for.

          Show
          chl501 ChiaHung Lin added a comment - Exactly. : ) And its size is a bit huge, so probably the subset of the content is needed if this is what we are looking for.
          Hide
          thomas.jungblut Thomas Jungblut added a comment -

          This file made me nuts.

          The XML was totally malformed and contained some chars that are not allowed, even the DMOZ parser of nutch failed on it.

          However I extracted the URLs via regex.
          It resulted in 7133283 vertices, that is quite cool. But they had lots of duplicate hosts, so I'll decided to deduplicate them. Resulting in 2440000 vertices which is enough.

          Final statistics:

          NumOfVertices: 2442507
          EdgeCounter: 32282149
          Size: ~680mb (682.624.440 bytes)

          The partitioning takes a bit of a time (way too long), but I'm working on a MR job that should parallize this task.

          Funny fail of this evening:
          I forgot to make new lines while creating the file. So the partitioner filled up with memory because BufferedReader.readLine() never triggered Time to go to bed.

          Anyone want to host this file?
          Otherwise I'm going to put this up to the google code repository along with the list of the SSSP example.

          I have not testet it yet, so the file can differ later..

          Show
          thomas.jungblut Thomas Jungblut added a comment - This file made me nuts. The XML was totally malformed and contained some chars that are not allowed, even the DMOZ parser of nutch failed on it. However I extracted the URLs via regex. It resulted in 7133283 vertices, that is quite cool. But they had lots of duplicate hosts, so I'll decided to deduplicate them. Resulting in 2440000 vertices which is enough. Final statistics: NumOfVertices: 2442507 EdgeCounter: 32282149 Size: ~680mb (682.624.440 bytes) The partitioning takes a bit of a time (way too long), but I'm working on a MR job that should parallize this task. Funny fail of this evening: I forgot to make new lines while creating the file. So the partitioner filled up with memory because BufferedReader.readLine() never triggered Time to go to bed. Anyone want to host this file? Otherwise I'm going to put this up to the google code repository along with the list of the SSSP example. I have not testet it yet, so the file can differ later..
          Hide
          thomas.jungblut Thomas Jungblut added a comment -

          The filecreator mixed the order so I had to regenerate a new file which has basically the same properties.
          I'm going to upload this now.

          I'll continue working on HAMA-423. I've written a whole new partitioner and refactored the examples to use it. It is actually much much more faster than the old one )). The MR job which I told of in the last post is already written too, but I don't want to add a JobTracker dependecy to Hama.

          Show
          thomas.jungblut Thomas Jungblut added a comment - The filecreator mixed the order so I had to regenerate a new file which has basically the same properties. I'm going to upload this now. I'll continue working on HAMA-423 . I've written a whole new partitioner and refactored the examples to use it. It is actually much much more faster than the old one )). The MR job which I told of in the last post is already written too, but I don't want to add a JobTracker dependecy to Hama.
          Show
          thomas.jungblut Thomas Jungblut added a comment - File can be found here: http://hama-shortest-paths.googlecode.com/svn/trunk/hama-gsoc/files/pagerank/input/pagerankAdjacencylist.txt

            People

            • Assignee:
              thomas.jungblut Thomas Jungblut
              Reporter:
              thomas.jungblut Thomas Jungblut
            • Votes:
              1 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development