[SPARK-6334] spark-local dir not getting cleared during ALS - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.2.0
Fix Version/s: None
Component/s: MLlib
Labels:
None

Description

when running bigger ALS training spark spills loads of temp data into the local-dir (in my case yarn/local/usercache/antony.mayi/appcache/... - running on YARN from cdh 5.3.2) eventually causing all the disks of all nodes running out of space (in my case I have 12TB of available disk capacity before kicking off the ALS but it all gets used (and yarn kills the containers when reaching 90%).

even with all recommended options (configuring checkpointing and forcing GC when possible) it still doesn't get cleared.

here is my (pseudo)code (pyspark):

sc.setCheckpointDir('/tmp')
training = sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK)
model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40)
sc._jvm.System.gc()

the training RDD has about 3.5 billions of items (~60GB on disk). after about 6 hours the ALS will consume all 12TB of disk space in local-dir data and gets killed. my cluster has 192 cores, 1.5TB RAM and for this task I am using 37 executors of 4 cores/28+4GB RAM each.

this is the graph of disk consumption pattern showing the space being all eaten from 7% to 90% during the ALS (90% is when YARN kills the container):

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

als-diskusage.png
14/Mar/15 10:25
11 kB
Antony Mayi
gc.png
01/Apr/15 15:11
12 kB
Antony Mayi

Issue Links

Is contained by

SPARK-5955 Add checkpointInterval to ALS

Resolved

SPARK-6717 Clear shuffle files after checkpointing in ALS

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Antony Mayi

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 14/Mar/15 10:22

Updated:: 05/Apr/15 22:44

Resolved:: 05/Apr/15 22:44