[SPARK-3731] RDD caching stops working in pyspark after some time - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.0.2, 1.1.0, 1.2.0
Fix Version/s: 1.1.1, 1.2.0
Component/s: PySpark, Spark Core
Labels:
None
Environment:

Linux, 32bit, both in local mode or in standalone cluster mode

Target Version/s:

1.1.1, 1.2.0

Description

Consider a file F which when loaded with sc.textFile and cached takes up slightly more than half of free memory for RDD cache.

When in PySpark the following is executed:
1) a = sc.textFile(F)
2) a.cache().count()
3) b = sc.textFile(F)
4) b.cache().count()
and then the following is repeated (for example 10 times):
a) a.unpersist().cache().count()
b) b.unpersist().cache().count()
after some time, there are no RDD cached in memory.

Also, since that time, no other RDD ever gets cached (the worker always reports something like "WARN CacheManager: Not enough space to cache partition rdd_23_5 in memory! Free memory is 277478190 bytes.", even if rdd_23_5 is ~50MB). The Executors tab of the Application Detail UI shows that all executors have 0MB memory used (which is consistent with the CacheManager warning).

When doing the same in scala, everything works perfectly.

I understand that this is a vague description, but I do no know how to describe the problem better.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

worker.log
29/Sep/14 20:09
84 kB
Milan Straka
spark-3731.txt.bz2
02/Oct/14 07:20
0.2 kB
Milan Straka
spark-3731.py
02/Oct/14 07:20
0.2 kB
Milan Straka
spark-3731.log
02/Oct/14 07:20
25 kB
Milan Straka

Issue Links

links to

[Github] Pull Request #2668 (davies)

Activity

People

Assignee:: Davies Liu

Reporter:: Milan Straka

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 29/Sep/14 20:03

Updated:: 07/Oct/14 20:09

Resolved:: 07/Oct/14 20:09