The current implementation of the join operation does not use an iterator (i.e. lazy evaluation), causing it to explicitly evaluate the co-grouped values. In big data applications, these value collections can be very large. This causes the cartesian product of all co-grouped values for a specific key of both RDDs to be kept in memory during the flatMapValues operation, resulting in an O(size(pair._1)*size(pair._2)) memory consumption instead of O(1). Very large value collections will therefore cause "GC overhead limit exceeded" exceptions and fail the task, or at least slow down execution dramatically.
Since cogroup returns an Iterable instance of an Array, the join implementation could be changed to the following, which uses lazy evaluation instead, and has almost no memory overhead:
Alternatively, if the current implementation is intentionally not using lazy evaluation for some reason, there could be a lazyJoin() method next to the original join implementation that utilizes lazy evaluation. This of course applies to other join operations as well.