Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-11879

Checkpoint support for DataFrame/Dataset

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • None
    • 2.1.0
    • SQL
    • None

    Description

      Explicit support for checkpointing DataFrames is need to be able to truncate lineages, prune the query plan (particularly the logical plan) and transparent failure recovery.

      While for recovery saving to a Parquet file may be sufficient, actually using that as a checkpoint (and truncating the lineage), requires reading the files back.

      This is required to be able to use DataFrames in iterative scenarios like Streaming and ML, as well as for avoiding expensive re-computations in case of executor failure when executing a complex chain of queries on very large datasets.

      Attachments

        Issue Links

          Activity

            People

              lian cheng Cheng Lian
              copris Cristian Opris
              Votes:
              2 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: