Description
Right now there are a bunch of different ways to cache tables in Spark SQL. For example:
- tweets.cache()
- sql("SELECT * FROM tweets").cache()
- table("tweets").cache()
- tweets.cache().registerTempTable(tweets)
- sql("CACHE TABLE tweets")
- cacheTable("tweets")
Each of the above commands has subtly different semantics, leading to a very confusing user experience. Ideally, we would stop doing caching based on simple tables names and instead have a phase of optimization that does intelligent matching of query plans with available cached data.
Attachments
Issue Links
- contains
-
SPARK-3641 Correctly populate SparkPlan.currentContext
- Resolved
- is related to
-
SPARK-2189 Method for removing temp tables created by registerAsTable
- Resolved
-
SPARK-3298 [SQL] registerAsTable / registerTempTable overwrites old tables
- Resolved
- relates to
-
SPARK-1379 Calling .cache() on a SchemaRDD should do something more efficient than caching the individual row objects.
- Resolved
- links to