Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-8582

Optimize checkpointing to avoid computing an RDD twice

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.0.0
    • 3.3.0
    • Spark Core

    Description

      In Spark, checkpointing allows the user to truncate the lineage of his RDD and save the intermediate contents to HDFS for fault tolerance. However, this is not currently implemented super efficiently:

      Every time we checkpoint an RDD, we actually compute it twice: once during the action that triggered the checkpointing in the first place, and once while we checkpoint (we iterate through an RDD's partitions and write them to disk). See this line for more detail: https://github.com/apache/spark/blob/0401cbaa8ee51c71f43604f338b65022a479da0a/core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala#L102.

      Instead, we should have a `CheckpointingInterator` that writes checkpoint data to HDFS while we run the action. This will speed up many usages of `RDD#checkpoint` by 2X.

      (Alternatively, the user can just cache the RDD before checkpointing it, but this is not always viable for very large input data. It's also not a great API to use in general.)

      Attachments

        Issue Links

          Activity

            People

              zsxwing Shixiong Zhu
              andrewor14 Andrew Or
              Votes:
              16 Vote for this issue
              Watchers:
              31 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: