Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-4818

Join operation should use iterator/lazy evaluation

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.1.1
    • Fix Version/s: 1.1.2, 1.2.1, 1.3.0
    • Component/s: Spark Core
    • Labels:
      None

      Description

      The current implementation of the join operation does not use an iterator (i.e. lazy evaluation), causing it to explicitly evaluate the co-grouped values. In big data applications, these value collections can be very large. This causes the cartesian product of all co-grouped values for a specific key of both RDDs to be kept in memory during the flatMapValues operation, resulting in an O(size(pair._1)*size(pair._2)) memory consumption instead of O(1). Very large value collections will therefore cause "GC overhead limit exceeded" exceptions and fail the task, or at least slow down execution dramatically.

      PairRDDFunctions.scala
      //...
      def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = {
        this.cogroup(other, partitioner).flatMapValues( pair =>
          for (v <- pair._1; w <- pair._2) yield (v, w)
        )
      }
      //...
      

      Since cogroup returns an Iterable instance of an Array, the join implementation could be changed to the following, which uses lazy evaluation instead, and has almost no memory overhead:

      PairRDDFunctions.scala
      //...
      def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = {
        this.cogroup(other, partitioner).flatMapValues( pair =>
          for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
        )
      }
      //...
      

      Alternatively, if the current implementation is intentionally not using lazy evaluation for some reason, there could be a lazyJoin() method next to the original join implementation that utilizes lazy evaluation. This of course applies to other join operations as well.

      Thanks!

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                zsxwing Shixiong Zhu
                Reporter:
                johannes.simon Johannes Simon
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: