Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21815

Undeterministic group labeling within small connected component

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Trivial
    • Resolution: Not A Bug
    • Affects Version/s: 1.6.3, 2.2.0
    • Fix Version/s: None
    • Component/s: GraphX
    • Labels:

      Description

      As I look in the code https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/LabelPropagation.scala#L61, when the number of vertices in each community is small and the number of iteration is large enough, all candidates will have same scores. Due to order in the set, each vertex will be assigned to different community id. By ordering vertexId, the problem solved.

      Sample code to reproduce this error:
      val vertices = spark.sparkContext.parallelize(Seq((1l,1), (2l, 1)))
      val edges = spark.sparkContext.parallelize(Seq(Edge(1l,2l, 1)))
      val g = Graph(vertices, edges)
      val c =LabelPropagation.run(g, 5)
      c.vertices.map(x => (x._1, x._2)).toDF.show

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              tuan3w nguyen duc tuan
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: