Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-33356

DAG Scheduler exhibits exponential runtime with PartitionerAwareUnion

    XMLWordPrintableJSON

Details

    • Bug
    • Status: In Progress
    • Minor
    • Resolution: Unresolved
    • 2.4.2, 3.0.1
    • None
    • Spark Core
    • None
    • Reproducible locally with 3.0.1, 2.4.2, and latest master.

    Description

      The current implementation of the DAGScheduler exhibits exponential runtime in DAGs with many PartitionerAwareUnions. The reason seems to be a mutual recursion between PartitionerAwareUnion.getPreferredLocations and DAGScheduler.getPreferredLocs.

      A minimal example reproducing the issue:

      object Example extends App {
        val partitioner = new HashPartitioner(2)
        val sc = new SparkContext(new SparkConf().setAppName("").setMaster("local[*]"))
        val rdd1 = sc.emptyRDD[(Int, Int)].partitionBy(partitioner)
        val rdd2 = (1 to 30).map(_ => rdd1)
        val rdd3 = rdd2.reduce(_ union _)
        rdd3.collect()
      }
      

      The whole app should take around one second to complete, as no actual work is done. However, it takes more time to submit the job than I am willing to wait.

      The underlying cause appears to be mutual recursion between PartitionerAwareUnion.getPreferredLocations and DAGScheduler.getPreferredLocs, which restarts graph traversal at each PartitionerAwareUnion with no memoization. Each node of the DAG is visited O(n!) (exponentially many) times.

      Note, that it is clear to me that you could use sc.union(rdd2) instead of rdd2.reduce(_ union _) to eliminate the problem. I use this just to demonstrate the issue in a sufficiently small example. Given a large DAG and many PartitionerAwareUnions, especially contructed by iterative algorithms, the problem can become relevant even without "abuse" of the union operation.

      The exponential recursion in DAG Schedular was largely fixed with SPARK-682, but in the special case of PartitionerAwareUnion, it is still possible. This may actually be an underlying cause of SPARK-29181.

      Attachments

        Activity

          People

            Unassigned Unassigned
            lucasbru Lucas Brutschy
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: