Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29319

Memory and disk usage not accurate when blocks are evicted and re-loaded in memory

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.4.4, 3.0.0
    • None
    • Spark Core
    • None

    Description

      I found this while running more targeted tests for the underlying behavior of this code, triggered by SPARK-27468. I ran this code:

      import java.util.Arrays
      import org.apache.spark.rdd._
      import org.apache.spark.storage._
      
      def newCachedRDD(level: StorageLevel): RDD[Array[Long]] = {
        val rdd = sc.parallelize(1 to 64, 64).map { i =>
          val a = new Array[Long](1024 * 1024)
          Arrays.fill(a, i)
          a
        }
        rdd.persist(level)
        rdd.count()
        rdd
      }
      
      val r1 = newCachedRDD(level = StorageLevel.MEMORY_AND_DISK)
      val r2 = newCachedRDD(level = StorageLevel.MEMORY_ONLY)
      

      With ./bin/spark-shell --master 'local-cluster[1,1,1024]'.

      After it runs, you end up with the expected values: r1 has everything cached, only using disk, because all its memory blocks were evicted by r2; r2 has as many blocks as the memory can hold.

      The problem shows up when you start playing with those RDDs again.

      Calling r1.count() will cause all of r2's blocks to be evicted, since r1's blocks are loaded back in memory. But no block update is sent to the driver about that load, so the driver does not know that the blocks are now in memory. The UI will show that r1 is using 0 bytes of memory, and r2 disappears from the storage page (this last one as expected).

      Calling r2.count() after that will cause r1's blocks to be evicted again. This will send updates to the driver, which will now double-count the disk usage. So if you keep doing this back and forth, r1's disk usage will keep growing in the UI, when in fact it doesn't change at all.

      Attachments

        Activity

          People

            Unassigned Unassigned
            vanzin Marcelo Masiero Vanzin
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: