Details

Improvement

Status: Resolved

Major

Resolution: Won't Fix

1.0.2

None

None
Description
We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice.
Motivation
 Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges.
 For highdegree vertex, one vertex concentrates vast resources. So the workload on few highdegree vertex should be decomposed by all machines
 For lowdegree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on lowdegree vertex.
Algorithm Description
 HybridCut
 HybridCutPlus
 Greedy BiCut
 a heuristic algorithm for bipartite
Result
 The left Y axis is replication factor, right axis is the balance (measured using CV, coefficient of variation) of either vertices or edges of all partitions. The balance of edges can infer computation balance, and the balance of vertices can infer communication balance.
 This is an example of a balanced partitioning achieving 20% saving on communication.
 This is a simple partitioning result of BiCut.
 in2.01m is a generated power law graph with alpha equals 2.0
Code
 https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
 Because the implementation breaks the current separation with PartitionStrategy.scala, so need to think of a way to support access to graph.
Reference
 Bipartiteoriented Distributed Graph Partitioning for Big Learning.
 PowerLyraâ€¯: Differentiated Graph Computation and Partitioning on Skewed Graphs