Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.5.1
-
None
Description
A simple line count example can be launched as similar to
SPARK_HOME=/mnt/tmp/spark MASTER=mesos://zk://zk.metamx-prod.com:2181/mesos-druid/metrics ./bin/spark-shell --conf spark.mesos.coarse=true --conf spark.cores.max=7 --conf spark.mesos.executor.memoryOverhead=2048 --conf spark.mesos.executor.home=/mnt/tmp/spark --conf spark.executor.extraJavaOptions='-Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=8 -XX:+PrintGCApplicationStoppedTime -XX:+PrintTenuringDistribution -XX:+PrintFlagsFinal -XX:+PrintAdaptiveSizePolicy -XX:+PrintReferenceGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:MaxDirectMemorySize=1024m -verbose:gc -XX:+PrintFlagsFinal -Djava.io.tmpdir=/mnt/tmp/scratch' --conf spark.hadoop.fs.s3n.awsAccessKeyId='REDACTED' --conf spark.hadoop.fs.s3n.awsSecretAccessKey='REDACTED' --conf spark.executor.memory=7g --conf spark.executorEnv.GLOG_v=9 --conf spark.storage.memoryFraction=0.0 --conf spark.shuffle.memoryFraction=0.0
In the shell the following lines can be executed:
val text_file = sc.textFile("s3n://REDACTED/charlesallen/tpch/lineitem.tbl").persist(org.apache.spark.storage.StorageLevel.DISK_ONLY)
text_file.map(l => 1).sum
which will result in
res0: Double = 6001215.0
for the TPCH 1GB dataset
Unfortunately the blockmgr directory remains on the executor node after termination of the spark context.
The log on the executor looks like this near the termination:
I1215 02:12:31.190878 130732 process.cpp:566] Parsed message name 'mesos.internal.ShutdownExecutorMessage' for executor(1)@172.19.67.30:58604 from slave(1)@172.19.67.30:5051 I1215 02:12:31.190928 130732 process.cpp:2382] Spawned process __http__(4)@172.19.67.30:58604 I1215 02:12:31.190932 130721 process.cpp:2392] Resuming executor(1)@172.19.67.30:58604 at 2015-12-15 02:12:31.190924800+00:00 I1215 02:12:31.190958 130702 process.cpp:2392] Resuming __http__(4)@172.19.67.30:58604 at 2015-12-15 02:12:31.190951936+00:00 I1215 02:12:31.190976 130721 exec.cpp:381] Executor asked to shutdown I1215 02:12:31.190943 130727 process.cpp:2392] Resuming __gc__@172.19.67.30:58604 at 2015-12-15 02:12:31.190937088+00:00 I1215 02:12:31.190991 130702 process.cpp:2497] Cleaning up __http__(4)@172.19.67.30:58604 I1215 02:12:31.191032 130721 process.cpp:2382] Spawned process (2)@172.19.67.30:58604 I1215 02:12:31.191040 130702 process.cpp:2392] Resuming (2)@172.19.67.30:58604 at 2015-12-15 02:12:31.191037952+00:00 I1215 02:12:31.191054 130702 exec.cpp:80] Scheduling shutdown of the executor I1215 02:12:31.191069 130721 exec.cpp:396] Executor::shutdown took 21572ns I1215 02:12:31.191073 130702 clock.cpp:260] Created a timer for (2)@172.19.67.30:58604 in 5secs in the future (2015-12-15 02:12:36.191062016+00:00) I1215 02:12:31.191066 130720 process.cpp:2392] Resuming (1)@172.19.67.30:58604 at 2015-12-15 02:12:31.191059200+00:00 15/12/15 02:12:31 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown I1215 02:12:31.240103 130732 clock.cpp:151] Handling timers up to 2015-12-15 02:12:31.240091136+00:00 I1215 02:12:31.240123 130732 clock.cpp:158] Have timeout(s) at 2015-12-15 02:12:31.240036096+00:00 I1215 02:12:31.240183 130730 process.cpp:2392] Resuming reaper(1)@172.19.67.30:58604 at 2015-12-15 02:12:31.240178176+00:00 I1215 02:12:31.240226 130730 clock.cpp:260] Created a timer for reaper(1)@172.19.67.30:58604 in 100ms in the future (2015-12-15 02:12:31.340212992+00:00) I1215 02:12:31.247019 130720 clock.cpp:260] Created a timer for (1)@172.19.67.30:58604 in 3secs in the future (2015-12-15 02:12:34.247005952+00:00) 15/12/15 02:12:31 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM 15/12/15 02:12:31 INFO ShutdownHookManager: Shutdown hook called ....no more java logs
If the shuffle fraction is NOT set to 0.0, and the data is allowed to stay in memory, then the following log can be seen at termination instead:
I1215 01:19:16.247705 120052 process.cpp:566] Parsed message name 'mesos.internal.ShutdownExecutorMessage' for executor(1)@172.19.67.24:60016 from slave(1)@172.19.67.24:5051 I1215 01:19:16.247745 120052 process.cpp:2382] Spawned process __http__(4)@172.19.67.24:60016 I1215 01:19:16.247747 120034 process.cpp:2392] Resuming executor(1)@172.19.67.24:60016 at 2015-12-15 01:19:16.247741952+00:00 I1215 01:19:16.247758 120030 process.cpp:2392] Resuming __gc__@172.19.67.24:60016 at 2015-12-15 01:19:16.247755008+00:00 I1215 01:19:16.247772 120034 exec.cpp:381] Executor asked to shutdown I1215 01:19:16.247772 120038 process.cpp:2392] Resuming __http__(4)@172.19.67.24:60016 at 2015-12-15 01:19:16.247767808+00:00 I1215 01:19:16.247791 120038 process.cpp:2497] Cleaning up __http__(4)@172.19.67.24:60016 I1215 01:19:16.247814 120034 process.cpp:2382] Spawned process (2)@172.19.67.24:60016 I1215 01:19:16.247823 120038 process.cpp:2392] Resuming (2)@172.19.67.24:60016 at 2015-12-15 01:19:16.247822080+00:00 I1215 01:19:16.247836 120038 exec.cpp:80] Scheduling shutdown of the executor I1215 01:19:16.247820 120039 process.cpp:2392] Resuming __gc__@172.19.67.24:60016 at 2015-12-15 01:19:16.247816960+00:00 I1215 01:19:16.247843 120034 exec.cpp:396] Executor::shutdown took 17355ns I1215 01:19:16.247849 120038 clock.cpp:260] Created a timer for (2)@172.19.67.24:60016 in 5secs in the future (2015-12-15 01:19:21.247843072+00:00) I1215 01:19:16.247855 120049 process.cpp:2392] Resuming (1)@172.19.67.24:60016 at 2015-12-15 01:19:16.247849984+00:00 15/12/15 01:19:16 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown 15/12/15 01:19:16 INFO MemoryStore: MemoryStore cleared 15/12/15 01:19:16 INFO BlockManager: BlockManager stopped 15/12/15 01:19:16 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 15/12/15 01:19:16 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 15/12/15 01:19:16 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down. 15/12/15 01:19:16 INFO ShutdownHookManager: Shutdown hook called .... no more java logs
Notice in the former case with dirty directories left over, there is a SIGTERM intercepted by the executor, and there are no log messages indicating disk cleaning occurred. In the later case, the shutdown hooks for memory and block managers are seen.
This is using Mesos 0.25.0
Attachments
Issue Links
- is depended upon by
-
SPARK-19111 S3 Mesos history upload fails silently if too large
- Resolved
- is related to
-
SPARK-12430 Temporary folders do not get deleted after Task completes causing problems with disk space.
- Resolved
- relates to
-
SPARK-1865 Improve behavior of cleanup of disk state
- Resolved
-
SPARK-6287 Add support for dynamic allocation in the Mesos coarse-grained scheduler
- Closed
- links to