Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23880

table cache should be lazy and don't trigger any job

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.0
    • 2.4.0
    • SQL
    • None

    Description

      val df = spark.range(10000000000L)
        .filter('id > 1000)
        .orderBy('id.desc)
        .cache()
      

      This triggers a job while the cache should be lazy. The problem is that, when creating `InMemoryRelation`, we build the RDD, which calls `SparkPlan.execute` and may trigger jobs, like sampling job for range partitioner, or broadcast job.

      We should create the RDD at physical phase.

      Attachments

        Activity

          People

            maropu Takeshi Yamamuro
            cloud_fan Wenchen Fan
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: