Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-2411

Add basic graph summarization algorithm

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Minor
    • Resolution: Implemented
    • 0.10.0
    • 1.0.0
    • None

    Description

      Graph summarization determines a structural grouping of similar vertices and edges to condense a graph and thus helps to uncover insights about patterns hidden in the graph. It can be used in OLAP-style operations on the graph and is similar to group by in SQL but on the graph structure instead of rows.

      The graph summarization operator represents every vertex group by a single vertex in the summarized graph; edges between vertices in the summary graph represent a group of edges between the vertex group members of the original graph. Summarization is defined by specifying grouping keys for vertices and edges, respectively.

      One publication that presents a Map/Reduce based approach is "Pagrol: Parallel graph olap over large-scale attributed graphs", however they pre-compute the graph-cube before it can be analyzed. With Flink, we can give the user an interactive way of summarizing the graph and do not need to compute the cube beforehand.
      A more complex approach focuses on summarization on graph patterns "SynopSys: Large Graph Analytics in the SAP HANA Database Through Summarization".

      However, I want to start with a simple algorithm that summarizes the graph on vertex and optionally edge values and additionally stores a count aggregate at summarized vertices/edges.

      Consider the following two examples (e.g., social network with users from cities and friendships with timestamp):

      Input graph:

      Vertices (id, value):
      (0, Leipzig)
      (1, Leipzig)
      (2, Dresden)
      (3, Dresden)
      (4, Dresden)
      (5, Berlin)

      Edges (source, target, value):
      (0, 1, 2014)
      (1, 0, 2014)
      (1, 2, 2013)
      (2, 1, 2013)
      (2, 3, 2014)
      (3, 2, 2014)
      (4, 0, 2013)
      (4, 1, 2015)
      (5, 2, 2015)
      (5, 3, 2015)

      Output graph (summarized on vertex value):

      Vertices (id, value, count)
      (0, Leipzig, 2) // "2 users from Leipzig"
      (2, Dresden, 3) // "3 users from Dresden"
      (5, Berlin, 1) // "1 user from Berlin"

      Edges (source, target, count)
      (0, 0, 2) // "2 edges between users in Leipzig"
      (0, 2, 1) // "1 edge from users in Leipzig to users in Dresden"
      (2, 0, 3) // "3 edges from users in Dresden to users in Leipzig"
      (2, 2, 2) // "2 edges between users in Dresden"
      (5, 2, 2) // "2 edges from users in Berlin to users in Dresden"

      Output graph (summarized on vertex and edge value):

      Vertices (id, value, count)
      (0, Leipzig, 2)
      (2, Dresden, 3)
      (5, Berlin, 1)

      Edges (source, target, value, count)
      (0, 0, 2014, 2) // ...
      (0, 2, 2013, 1) // ...
      (2, 0, 2013, 2) // "2 edges from users in Dresden to users in Leipzig with timestamp 2013"
      (2, 0, 2015, 1) // "1 edge from users in Dresden to users in Leipzig with timestamp 2015"
      (2, 2, 2014, 2) // ...
      (5, 2, 2015, 2) // ...

      I've already implemented two versions of the summarization algorithm in our own project Gradoop, which is a graph analytics stack on top of Hadoop + Gelly/Flink with a fixed data model. You can see the current WIP here:

      1 Abstract summarization
      2 Implementation using cross
      3 Implementation using joins
      4 Tests
      5 TestGraph

      I would basically use the same implementation as in 3 in combination with KeySelectors to select the grouping keys on vertices and edges.

      As you can see in the example, each vertex in the resulting graph has a vertex id that is contained in the original graph. This id is the smallest id among the grouped vertices (e.g., vertices 2, 3 and 4 represent Dresden, so 2 is the group representative). The latter is necessary to correctly assign the summarized edges. Maybe there is a smarter way to do it of which I did not think of yet.

      I would like to contribute this to Flink and of course, if you have any suggestions/improvements or do not want this at all (hopefully not), please let me know.

      Attachments

        Activity

          People

            mju Martin Junghanns
            mju Martin Junghanns
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: