[SPARK-3212] Improve the clarity of caching semantics - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.2.0
Component/s: SQL
Labels:
None

Target Version/s:

1.2.0

Description

Right now there are a bunch of different ways to cache tables in Spark SQL. For example:

tweets.cache()
sql("SELECT * FROM tweets").cache()
table("tweets").cache()
tweets.cache().registerTempTable(tweets)
sql("CACHE TABLE tweets")
cacheTable("tweets")

Each of the above commands has subtly different semantics, leading to a very confusing user experience. Ideally, we would stop doing caching based on simple tables names and instead have a phase of optimization that does intelligent matching of query plans with available cached data.

Attachments

Issue Links

contains

SPARK-3641 Correctly populate SparkPlan.currentContext

Resolved

is related to

SPARK-2189 Method for removing temp tables created by registerAsTable

Resolved

SPARK-3298 [SQL] registerAsTable / registerTempTable overwrites old tables

Resolved

relates to

SPARK-1379 Calling .cache() on a SchemaRDD should do something more efficient than caching the individual row objects.

Resolved

links to

[Github] Pull Request #2501 (marmbrus)

Activity

People

Assignee:: Michael Armbrust

Reporter:: Michael Armbrust

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 25/Aug/14 19:48

Updated:: 03/Oct/14 19:34

Resolved:: 03/Oct/14 19:34