[SPARK-23880] table cache should be lazy and don't trigger any job - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3.0
Fix Version/s: 2.4.0
Component/s: SQL
Labels:
None

Description

val df = spark.range(10000000000L)
  .filter('id > 1000)
  .orderBy('id.desc)
  .cache()

This triggers a job while the cache should be lazy. The problem is that, when creating `InMemoryRelation`, we build the RDD, which calls `SparkPlan.execute` and may trigger jobs, like sampling job for range partitioner, or broadcast job.

We should create the RDD at physical phase.

Attachments

Issue Links

links to

[Github] Pull Request #21018 (maropu)

Activity

People

Assignee:: Takeshi Yamamuro

Reporter:: Wenchen Fan

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 06/Apr/18 06:24

Updated:: 09/Sep/18 04:03

Resolved:: 25/Apr/18 11:06