Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-3731

RDD caching stops working in pyspark after some time

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.0.2, 1.1.0, 1.2.0
    • 1.1.1, 1.2.0
    • PySpark, Spark Core
    • None
    • Linux, 32bit, both in local mode or in standalone cluster mode

    Description

      Consider a file F which when loaded with sc.textFile and cached takes up slightly more than half of free memory for RDD cache.

      When in PySpark the following is executed:
      1) a = sc.textFile(F)
      2) a.cache().count()
      3) b = sc.textFile(F)
      4) b.cache().count()
      and then the following is repeated (for example 10 times):
      a) a.unpersist().cache().count()
      b) b.unpersist().cache().count()
      after some time, there are no RDD cached in memory.

      Also, since that time, no other RDD ever gets cached (the worker always reports something like "WARN CacheManager: Not enough space to cache partition rdd_23_5 in memory! Free memory is 277478190 bytes.", even if rdd_23_5 is ~50MB). The Executors tab of the Application Detail UI shows that all executors have 0MB memory used (which is consistent with the CacheManager warning).

      When doing the same in scala, everything works perfectly.

      I understand that this is a vague description, but I do no know how to describe the problem better.

      Attachments

        1. worker.log
          84 kB
          Milan Straka
        2. spark-3731.txt.bz2
          0.2 kB
          Milan Straka
        3. spark-3731.py
          0.2 kB
          Milan Straka
        4. spark-3731.log
          25 kB
          Milan Straka

        Activity

          People

            davies Davies Liu
            straka Milan Straka
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: