[SPARK-11879] Checkpoint support for DataFrame/Dataset - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: 2.1.0
Component/s: SQL
Labels:
None

Description

Explicit support for checkpointing DataFrames is need to be able to truncate lineages, prune the query plan (particularly the logical plan) and transparent failure recovery.

While for recovery saving to a Parquet file may be sufficient, actually using that as a checkpoint (and truncating the lineage), requires reading the files back.

This is required to be able to use DataFrames in iterative scenarios like Streaming and ML, as well as for avoiding expensive re-computations in case of executor failure when executing a complex chain of queries on very large datasets.

Attachments

Issue Links

duplicates

SPARK-17972 Query planning slows down dramatically for large query plans even when sub-trees are cached

Resolved

Activity

People

Assignee:: Cheng Lian

Reporter:: Cristian Opris

Votes:: 2 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 20/Nov/15 12:32

Updated:: 02/Nov/16 21:11

Resolved:: 02/Nov/16 21:11