Description
Explicit support for checkpointing DataFrames is need to be able to truncate lineages, prune the query plan (particularly the logical plan) and transparent failure recovery.
While for recovery saving to a Parquet file may be sufficient, actually using that as a checkpoint (and truncating the lineage), requires reading the files back.
This is required to be able to use DataFrames in iterative scenarios like Streaming and ML, as well as for avoiding expensive re-computations in case of executor failure when executing a complex chain of queries on very large datasets.
Attachments
Issue Links
- duplicates
-
SPARK-17972 Query planning slows down dramatically for large query plans even when sub-trees are cached
- Resolved