Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-4896

PageRank algorithm for directed graphs

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Implemented
    • Affects Version/s: 1.2.0
    • Fix Version/s: 1.3.0
    • Component/s: Gelly
    • Labels:

      Description

      Gelly includes PageRank implementations for scatter-gather and gather-sum-apply. Both ship with the warning "The implementation assumes that each page has at least one incoming and one outgoing link."

      PageRank is a directed algorithm and sources and sinks are common in directed graphs.

      Sinks drain the total score across the graph which affects convergence and the balance of the random hop (convergence is not currently a feature of Gelly's PageRanks as this a very recent feature from FLINK-3888).

      Sources are handled nicely by the algorithm highlighted on Flink's features page under "Iterations and Delta Iterations" since score deltas are transmitted and a source's score never changes (is always equal to the random hop probability divided by the vertex count).
      https://flink.apache.org/features.html

      We should find an implementation featuring convergence and unrestricted processing of directed graphs and move other implementations to Gelly examples.

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user greghogan opened a pull request:

          https://github.com/apache/flink/pull/2733

          FLINK-4896 [gelly] PageRank algorithm for directed graphs

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/greghogan/flink 4896_pagerank_algorithm_for_directed_graphs

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/2733.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #2733


          commit bddc6437c3202ee981f0eeeaa684e5ac4274d23a
          Author: Greg Hogan <code@greghogan.com>
          Date: 2016-10-24T20:14:16Z

          FLINK-4896 [gelly] PageRank algorithm for directed graphs


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user greghogan opened a pull request: https://github.com/apache/flink/pull/2733 FLINK-4896 [gelly] PageRank algorithm for directed graphs You can merge this pull request into a Git repository by running: $ git pull https://github.com/greghogan/flink 4896_pagerank_algorithm_for_directed_graphs Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2733.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2733 commit bddc6437c3202ee981f0eeeaa684e5ac4274d23a Author: Greg Hogan <code@greghogan.com> Date: 2016-10-24T20:14:16Z FLINK-4896 [gelly] PageRank algorithm for directed graphs
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user vasia commented on the issue:

          https://github.com/apache/flink/pull/2733

          Hi @greghogan,
          thanks for this PR. Do you have any idea how this implementation compares with the existing ones? I'm curious since it uses a bulk iteration and the existing ones are using delta iterations. Is the idea to keep this one as the library implementation and move the existing scatter-gather and gsa ones to the examples?
          Thanks!

          Show
          githubbot ASF GitHub Bot added a comment - Github user vasia commented on the issue: https://github.com/apache/flink/pull/2733 Hi @greghogan, thanks for this PR. Do you have any idea how this implementation compares with the existing ones? I'm curious since it uses a bulk iteration and the existing ones are using delta iterations. Is the idea to keep this one as the library implementation and move the existing scatter-gather and gsa ones to the examples? Thanks!
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user greghogan commented on the issue:

          https://github.com/apache/flink/pull/2733

          Running on a c4.xlarge with 4 slots and a 4 GB preallocated TaskManager heap. EdgeList measures the time to simplify the graph since the library PageRank using Scatter-Gather ("PageRankSG") requires each vertex to have both incoming and outgoing edges. "PageRank" is the algorithm from this PR.

          Algorithm | Scale 16 | Scale 18
          ------------ | ------------- | -------------
          EdgeList | 2537 ms | 8779 ms
          PageRank | 9563 ms | 39558 ms
          PageRankSG | 11188 ms | 47736 ms

          Show
          githubbot ASF GitHub Bot added a comment - Github user greghogan commented on the issue: https://github.com/apache/flink/pull/2733 Running on a c4.xlarge with 4 slots and a 4 GB preallocated TaskManager heap. EdgeList measures the time to simplify the graph since the library PageRank using Scatter-Gather ("PageRankSG") requires each vertex to have both incoming and outgoing edges. "PageRank" is the algorithm from this PR. Algorithm | Scale 16 | Scale 18 ------------ | ------------- | ------------- EdgeList | 2537 ms | 8779 ms PageRank | 9563 ms | 39558 ms PageRankSG | 11188 ms | 47736 ms
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user greghogan commented on the issue:

          https://github.com/apache/flink/pull/2733

          @vasia Running on a c4.xlarge with 4 slots and a 4 GB preallocated TaskManager heap. EdgeList measures the time to simplify the graph since the library PageRank using Scatter-Gather ("PageRankSG") and Gather-Sum-Apply ("PageRankGSA") require each vertex to have both incoming and outgoing edges. "PageRank" is the algorithm from this PR. Each execution is performing 10 iterations.

          Algorithm | Scale 16 | Scale 18 | Scale 20 | Scale 22
          ------------ | ------------- | ------------- | ------------- | -------------
          EdgeList | 2.537 s | 8.779 s | 34.105 s | 141.512 s
          PageRank | 9.563 s | 39.558 s | 168.401 s | 740.345 s
          PageRankSG | 11.188 s | 47.736 s | 216.688 s | 1041.176 s
          PageRankGSA | 14.001 s | 60.663 s | 268.568 s | 1241.344 s
          Speedup over SG | 23% | 26% | 35% | 50%

          I'm surprised that pageRankSG is performing faster than PageRankGSA.

          The current SG and GSA implementations would vw good examples, and already have integration tests.

          Show
          githubbot ASF GitHub Bot added a comment - Github user greghogan commented on the issue: https://github.com/apache/flink/pull/2733 @vasia Running on a c4.xlarge with 4 slots and a 4 GB preallocated TaskManager heap. EdgeList measures the time to simplify the graph since the library PageRank using Scatter-Gather ("PageRankSG") and Gather-Sum-Apply ("PageRankGSA") require each vertex to have both incoming and outgoing edges. "PageRank" is the algorithm from this PR. Each execution is performing 10 iterations. Algorithm | Scale 16 | Scale 18 | Scale 20 | Scale 22 ------------ | ------------- | ------------- | ------------- | ------------- EdgeList | 2.537 s | 8.779 s | 34.105 s | 141.512 s PageRank | 9.563 s | 39.558 s | 168.401 s | 740.345 s PageRankSG | 11.188 s | 47.736 s | 216.688 s | 1041.176 s PageRankGSA | 14.001 s | 60.663 s | 268.568 s | 1241.344 s Speedup over SG | 23% | 26% | 35% | 50% I'm surprised that pageRankSG is performing faster than PageRankGSA. The current SG and GSA implementations would vw good examples, and already have integration tests.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user greghogan commented on the issue:

          https://github.com/apache/flink/pull/2733

          @vasia was there more you wanted to review here?

          Show
          githubbot ASF GitHub Bot added a comment - Github user greghogan commented on the issue: https://github.com/apache/flink/pull/2733 @vasia was there more you wanted to review here?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user vasia commented on the issue:

          https://github.com/apache/flink/pull/2733

          Thanks for the results @greghogan! Updating the docs would be nice, otherwise +1.
          Do you think we should move existing implementations in the examples (there are still relevant to demonstrate iteration APIs) and keep this one as the only library method since it's the fastest and more general one?

          Show
          githubbot ASF GitHub Bot added a comment - Github user vasia commented on the issue: https://github.com/apache/flink/pull/2733 Thanks for the results @greghogan! Updating the docs would be nice, otherwise +1. Do you think we should move existing implementations in the examples (there are still relevant to demonstrate iteration APIs) and keep this one as the only library method since it's the fastest and more general one?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user greghogan commented on the issue:

          https://github.com/apache/flink/pull/2733

          @vasia moved SG and GSA implementations to examples and updated docs.

          Show
          githubbot ASF GitHub Bot added a comment - Github user greghogan commented on the issue: https://github.com/apache/flink/pull/2733 @vasia moved SG and GSA implementations to examples and updated docs.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flink/pull/2733

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/2733
          Hide
          greghogan Greg Hogan added a comment -

          Implemented in ea14053fe32280ffc36e586b5d3712c751fa1f84

          Show
          greghogan Greg Hogan added a comment - Implemented in ea14053fe32280ffc36e586b5d3712c751fa1f84

            People

            • Assignee:
              greghogan Greg Hogan
              Reporter:
              greghogan Greg Hogan
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development