[SPARK-30470] Uncache table in tempViews if needed on session closed - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 2.3.2
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

Currently, Spark will not cleanup cached tables in tempViews produced by sql like following

`CACHE TABLE table1 as SELECT ....`

There are risks that the `uncache table` not called due to session closed unexpectedly, or user closed manually. Then these temp views will lost, and we can not visit them in other session, but the cached plan still exists in the `CacheManager`.

Moreover, the leaks may cause the failure of the subsequent query, one failure we encoutered in our production environment is as below:

Caused by: java.io.FileNotFoundException: File does not exist: /user/xxxx/xx/data__db60e76d_91b8_42f3_909d_5c68692ecdd4Caused by: java.io.FileNotFoundException: File does not exist: /user/xxxx/xx/data__db60e76d_91b8_42f3_909d_5c68692ecdd4It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:131) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:182) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage0.scan_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage0.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)

The above exception happens when user update the data of the table, but spark still use the old cached plan.

Attachments

Issue Links

is duplicated by

SPARK-29911 Cache table may memory leak when session closed

Resolved

links to

GitHub Pull Request #27149

Activity

People

Assignee:: Unassigned

Reporter:: liupengcheng

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 09/Jan/20 10:32

Updated:: 21/Jan/20 04:55

Resolved:: 20/Jan/20 09:57