[SPARK-1955] VertexRDD can incorrectly assume index sharing - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 0.9.0, 0.9.1, 1.0.0
Fix Version/s: 1.2.2, 1.3.0
Component/s: GraphX
Labels:
None

Description

Many VertexRDD operations (diff, leftJoin, innerJoin) can use a fast zip join if both operands are VertexRDDs sharing the same index (i.e., one operand is derived from the other). This check is implemented by matching on the operand type and using the fast join strategy if both are VertexRDDs.

This is clearly fine when both do in fact share the same index. It is also fine when the two VertexRDDs have the same partitioner but different indexes, because each VertexPartition will detect the index mismatch and fall back to the slow but correct local join strategy.

However, when they have different numbers of partitions or different partition functions, an exception or even silently incorrect results can occur.

For example:

import org.apache.spark._
import org.apache.spark.graphx._

// Construct VertexRDDs with different numbers of partitions
val a = VertexRDD(sc.parallelize(List((0L, 1), (1L, 2)), 1))
val b = VertexRDD(sc.parallelize(List((0L, 5)), 8))
// Try to join them. Appears to work...
val c = a.innerJoin(b) { (vid, x, y) => x + y }
// ... but then fails with java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions
c.collect

// Construct VertexRDDs with different partition functions
val a = VertexRDD(sc.parallelize(List((0L, 1), (1L, 2))).partitionBy(new HashPartitioner(2)))
val bVerts = sc.parallelize(List((1L, 5)))
val b = VertexRDD(bVerts.partitionBy(new RangePartitioner(2, bVerts)))
// Try to join them. We expect (1L, 7).
val c = a.innerJoin(b) { (vid, x, y) => x + y }
// Silent failure: we get an empty set!
c.collect

VertexRDD should check equality of partitioners before using the fast zip join. If the partitioners are different, the two datasets should be automatically co-partitioned.

Attachments

Issue Links

is part of

SPARK-2365 Add IndexedRDD, an efficient updatable key-value store

Closed

is related to

SPARK-5790 Add tests for: VertexRDD's won't zip properly for `diff` capability

Resolved

links to

[Github] Pull Request #4705 (brennonyork)

Activity

People

Assignee:: Brennon York

Reporter:: Ankur Dave

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 28/May/14 21:57

Updated:: 25/Feb/15 22:15

Resolved:: 25/Feb/15 22:15