Mahout
  1. Mahout
  2. MAHOUT-742

Pagerank implementation in Map/Reduce

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.6
    • Fix Version/s: 0.6
    • Component/s: Graph
    • Labels:
      None

      Description

      Hi,

      my name is Christoph Nagel. I'm student on technical university Berlin and participating on the course of Isabel Drost and Sebastian Schelter.
      My work is to implement the pagerank-algorithm, where the pagerank-vector fits in memory.
      For the computation I used the naive algorithm shown in the book 'Mining of Massive Datasets' from Rajaraman & Ullman (http://www-scf.usc.edu/~csci572/2012Spring/UllmanMiningMassiveDataSets.pdf).
      Matrix- and vector-multiplication are done with mahout methods.

      Most work is the transformation the input graph, which has to consists of a nodes- and edges file.
      Format of nodes file: <node>\n
      Format of edges file: <startNode>\t<endNode>\n

      Therefore I created the following classes:

      • LineIndexer: assigns each line an index
      • EdgesToIndex: indexes the nodes of the edges
      • EdgesIndexToTransitionMatrix: creates the transition matrix
      • Pagerank: computes PR from transition matrix
      • JoinNodesWithPagerank: creates the joined output
      • PagerankExampleJob: does the complete job

      Each class has a test (not PagerankExampleJob) and I took the example of the book for evaluating.

      1. MAHOUT-742.patch
        53 kB
        Christoph Nagel

        Activity

        Hide
        Sebastian Schelter added a comment -

        Thank you very much for your great work, Christoph!

        Unfortunately we have to clarify whether we are legally allowed to include a pagerank implementation before we can commit your contribution...

        Show
        Sebastian Schelter added a comment - Thank you very much for your great work, Christoph! Unfortunately we have to clarify whether we are legally allowed to include a pagerank implementation before we can commit your contribution...
        Hide
        Sebastian Schelter added a comment -

        Patch refactored and committed. Thank you very much for your great work, Christoph!

        Show
        Sebastian Schelter added a comment - Patch refactored and committed. Thank you very much for your great work, Christoph!
        Hide
        Sebastian Schelter added a comment -

        I'd like to note here that this issue only covers the most straight-forward approach to computing pagerank in Map/Reduce and the code assumes that the pagerank vector and any row from the transition matrix fit into memory.

        There are more efficient ways to compute pagerank as described in http://portal.acm.org/citation.cfm?id=1830263 and blockwise matrix-vector mulitplication is necessary if the vectors don't fit into memory anymore.

        Nevertheless we should be very happy to have this contribution and I think it offers a good starting point for further improvements.

        Show
        Sebastian Schelter added a comment - I'd like to note here that this issue only covers the most straight-forward approach to computing pagerank in Map/Reduce and the code assumes that the pagerank vector and any row from the transition matrix fit into memory. There are more efficient ways to compute pagerank as described in http://portal.acm.org/citation.cfm?id=1830263 and blockwise matrix-vector mulitplication is necessary if the vectors don't fit into memory anymore. Nevertheless we should be very happy to have this contribution and I think it offers a good starting point for further improvements.
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #933 (See https://builds.apache.org/job/Mahout-Quality/933/)
        MAHOUT-742 Pagerank implementation in Map/Reduce

        ssc : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1145213
        Files :

        • /mahout/trunk/core/src/test/java/org/apache/mahout/graph/linkanalysis
        • /mahout/trunk/core/src/main/java/org/apache/mahout/graph/linkanalysis/PageRankJob.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/graph/linkanalysis
        • /mahout/trunk/core/src/test/java/org/apache/mahout/graph/GraphTestCase.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/graph/common/GraphUtils.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/graph/linkanalysis/PageRankJobTest.java
        • /mahout/trunk/math/src/main/java/org/apache/mahout/math/AbstractVector.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/graph/model/Edge.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/MathHelper.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #933 (See https://builds.apache.org/job/Mahout-Quality/933/ ) MAHOUT-742 Pagerank implementation in Map/Reduce ssc : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1145213 Files : /mahout/trunk/core/src/test/java/org/apache/mahout/graph/linkanalysis /mahout/trunk/core/src/main/java/org/apache/mahout/graph/linkanalysis/PageRankJob.java /mahout/trunk/core/src/main/java/org/apache/mahout/graph/linkanalysis /mahout/trunk/core/src/test/java/org/apache/mahout/graph/GraphTestCase.java /mahout/trunk/core/src/main/java/org/apache/mahout/graph/common/GraphUtils.java /mahout/trunk/core/src/test/java/org/apache/mahout/graph/linkanalysis/PageRankJobTest.java /mahout/trunk/math/src/main/java/org/apache/mahout/math/AbstractVector.java /mahout/trunk/core/src/main/java/org/apache/mahout/graph/model/Edge.java /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/MathHelper.java
        Hide
        Nilesh Chakraborty added a comment - - edited

        Hey Sebastian Schelter!

        I've been working on an implementation of large-scale blockwise matrix-vector multiplication using a single MapReduce job. The current algorithms and implementations need two MapReduce jobs for blockwise multiplication (or any sparse mat-vec mult where the vector isn't stored in memory). I will be using it to implement PageRank. I'll benchmark my implementation against the state-of-the-art in MapReduce-based PageRank - Pegasus (they've contributed the Pegasus code to https://github.com/intel-hadoop/HiBench).

        If my version turns out to be faster, I'll be writing the code for algorithms like SVD and Lanczos algorithm (http://en.wikipedia.org/wiki/Lanczos_algorithm) too.

        Do you think these can make for a useful contribution to Mahout? I need to keep that in mind before I go forward with coding.

        Show
        Nilesh Chakraborty added a comment - - edited Hey Sebastian Schelter ! I've been working on an implementation of large-scale blockwise matrix-vector multiplication using a single MapReduce job. The current algorithms and implementations need two MapReduce jobs for blockwise multiplication (or any sparse mat-vec mult where the vector isn't stored in memory). I will be using it to implement PageRank. I'll benchmark my implementation against the state-of-the-art in MapReduce-based PageRank - Pegasus (they've contributed the Pegasus code to https://github.com/intel-hadoop/HiBench ). If my version turns out to be faster, I'll be writing the code for algorithms like SVD and Lanczos algorithm ( http://en.wikipedia.org/wiki/Lanczos_algorithm ) too. Do you think these can make for a useful contribution to Mahout? I need to keep that in mind before I go forward with coding.
        Hide
        Nilesh Chakraborty added a comment -

        Also, what do I need know about Mahout's policy for using 3rd party linear algebra libraries like MTJ, Colt etc.? Say if I need to borrow a lot of functionality from one such library, do I need to rewrite the code so as to eliminate any dependencies on such libraries? What about Apache Commons? I'd also appreciate it if you could give me some pointers/resources where such guidelines are detailed.

        Show
        Nilesh Chakraborty added a comment - Also, what do I need know about Mahout's policy for using 3rd party linear algebra libraries like MTJ, Colt etc.? Say if I need to borrow a lot of functionality from one such library, do I need to rewrite the code so as to eliminate any dependencies on such libraries? What about Apache Commons? I'd also appreciate it if you could give me some pointers/resources where such guidelines are detailed.
        Hide
        Sebastian Schelter added a comment -

        Mahout also contains a math library, first you should check whether it
        already contains what you need

        Show
        Sebastian Schelter added a comment - Mahout also contains a math library, first you should check whether it already contains what you need
        Hide
        Nilesh Chakraborty added a comment - - edited

        My bad, didn't know that Mahout org.apache.mahout.math.Matrix and her friends were so full-featured. Thanks. Then it shouldn't be any problem.

        Actually I had come across #MAHOUT-879 (Remove all graph algorithms with the exception of PageRank) and was just checking with you if large-scale sparse mat-vec mult and PageRank implementations in MapReduce are welcome.

        Show
        Nilesh Chakraborty added a comment - - edited My bad, didn't know that Mahout org.apache.mahout.math.Matrix and her friends were so full-featured. Thanks. Then it shouldn't be any problem. Actually I had come across # MAHOUT-879 (Remove all graph algorithms with the exception of PageRank) and was just checking with you if large-scale sparse mat-vec mult and PageRank implementations in MapReduce are welcome.

          People

          • Assignee:
            Sebastian Schelter
            Reporter:
            Christoph Nagel
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development