Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-8666

checkpointing does not take advantage of persisted/cached RDDs

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • None
    • None
    • None
    • None

    Description

      I have been noticing that when checkpointing RDDs, all operations are occurring TWICE.

      For example, when I run the following code and watch the stages...

      val newRDD = prevRDD.map(a => (a._1, 1L)).distinct.persist(StorageLevel.MEMORY_AND_DISK_SER)
      newRDD.checkpoint
      print(newRDD.count())
      

      I see distinct and count operations appearing TWICE, and shuffle disk writes and reads (from the distinct) occurring TWICE.

      My newRDD is persisted to memory, why can't the checkpoint simply save those partitions to disk when the first operations have completed?

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              glenn.strycker@gmail.com Glenn Strycker
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: