Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-12330

Mesos coarse executor does not cleanup blockmgr properly on termination if data is stored on disk

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.5.1
    • Fix Version/s: 2.0.0
    • Component/s: Block Manager, Mesos
    • Labels:
      None
    • Target Version/s:

      Description

      A simple line count example can be launched as similar to

      SPARK_HOME=/mnt/tmp/spark MASTER=mesos://zk://zk.metamx-prod.com:2181/mesos-druid/metrics ./bin/spark-shell --conf spark.mesos.coarse=true --conf spark.cores.max=7 --conf spark.mesos.executor.memoryOverhead=2048 --conf spark.mesos.executor.home=/mnt/tmp/spark --conf spark.executor.extraJavaOptions='-Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=8 -XX:+PrintGCApplicationStoppedTime -XX:+PrintTenuringDistribution -XX:+PrintFlagsFinal -XX:+PrintAdaptiveSizePolicy -XX:+PrintReferenceGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:MaxDirectMemorySize=1024m -verbose:gc -XX:+PrintFlagsFinal -Djava.io.tmpdir=/mnt/tmp/scratch' --conf spark.hadoop.fs.s3n.awsAccessKeyId='REDACTED' --conf spark.hadoop.fs.s3n.awsSecretAccessKey='REDACTED' --conf spark.executor.memory=7g --conf spark.executorEnv.GLOG_v=9 --conf spark.storage.memoryFraction=0.0 --conf spark.shuffle.memoryFraction=0.0
      

      In the shell the following lines can be executed:

      val text_file = sc.textFile("s3n://REDACTED/charlesallen/tpch/lineitem.tbl").persist(org.apache.spark.storage.StorageLevel.DISK_ONLY)
      
      text_file.map(l => 1).sum
      

      which will result in

      res0: Double = 6001215.0
      

      for the TPCH 1GB dataset

      Unfortunately the blockmgr directory remains on the executor node after termination of the spark context.

      The log on the executor looks like this near the termination:

      I1215 02:12:31.190878 130732 process.cpp:566] Parsed message name 'mesos.internal.ShutdownExecutorMessage' for executor(1)@172.19.67.30:58604 from slave(1)@172.19.67.30:5051
      I1215 02:12:31.190928 130732 process.cpp:2382] Spawned process __http__(4)@172.19.67.30:58604
      I1215 02:12:31.190932 130721 process.cpp:2392] Resuming executor(1)@172.19.67.30:58604 at 2015-12-15 02:12:31.190924800+00:00
      I1215 02:12:31.190958 130702 process.cpp:2392] Resuming __http__(4)@172.19.67.30:58604 at 2015-12-15 02:12:31.190951936+00:00
      I1215 02:12:31.190976 130721 exec.cpp:381] Executor asked to shutdown
      I1215 02:12:31.190943 130727 process.cpp:2392] Resuming __gc__@172.19.67.30:58604 at 2015-12-15 02:12:31.190937088+00:00
      I1215 02:12:31.190991 130702 process.cpp:2497] Cleaning up __http__(4)@172.19.67.30:58604
      I1215 02:12:31.191032 130721 process.cpp:2382] Spawned process (2)@172.19.67.30:58604
      I1215 02:12:31.191040 130702 process.cpp:2392] Resuming (2)@172.19.67.30:58604 at 2015-12-15 02:12:31.191037952+00:00
      I1215 02:12:31.191054 130702 exec.cpp:80] Scheduling shutdown of the executor
      I1215 02:12:31.191069 130721 exec.cpp:396] Executor::shutdown took 21572ns
      I1215 02:12:31.191073 130702 clock.cpp:260] Created a timer for (2)@172.19.67.30:58604 in 5secs in the future (2015-12-15 02:12:36.191062016+00:00)
      I1215 02:12:31.191066 130720 process.cpp:2392] Resuming (1)@172.19.67.30:58604 at 2015-12-15 02:12:31.191059200+00:00
      15/12/15 02:12:31 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
      I1215 02:12:31.240103 130732 clock.cpp:151] Handling timers up to 2015-12-15 02:12:31.240091136+00:00
      I1215 02:12:31.240123 130732 clock.cpp:158] Have timeout(s) at 2015-12-15 02:12:31.240036096+00:00
      I1215 02:12:31.240183 130730 process.cpp:2392] Resuming reaper(1)@172.19.67.30:58604 at 2015-12-15 02:12:31.240178176+00:00
      I1215 02:12:31.240226 130730 clock.cpp:260] Created a timer for reaper(1)@172.19.67.30:58604 in 100ms in the future (2015-12-15 02:12:31.340212992+00:00)
      I1215 02:12:31.247019 130720 clock.cpp:260] Created a timer for (1)@172.19.67.30:58604 in 3secs in the future (2015-12-15 02:12:34.247005952+00:00)
      15/12/15 02:12:31 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM
      15/12/15 02:12:31 INFO ShutdownHookManager: Shutdown hook called
      
      ....no more java logs
      

      If the shuffle fraction is NOT set to 0.0, and the data is allowed to stay in memory, then the following log can be seen at termination instead:

      I1215 01:19:16.247705 120052 process.cpp:566] Parsed message name 'mesos.internal.ShutdownExecutorMessage' for executor(1)@172.19.67.24:60016 from slave(1)@172.19.67.24:5051
      I1215 01:19:16.247745 120052 process.cpp:2382] Spawned process __http__(4)@172.19.67.24:60016
      I1215 01:19:16.247747 120034 process.cpp:2392] Resuming executor(1)@172.19.67.24:60016 at 2015-12-15 01:19:16.247741952+00:00
      I1215 01:19:16.247758 120030 process.cpp:2392] Resuming __gc__@172.19.67.24:60016 at 2015-12-15 01:19:16.247755008+00:00
      I1215 01:19:16.247772 120034 exec.cpp:381] Executor asked to shutdown
      I1215 01:19:16.247772 120038 process.cpp:2392] Resuming __http__(4)@172.19.67.24:60016 at 2015-12-15 01:19:16.247767808+00:00
      I1215 01:19:16.247791 120038 process.cpp:2497] Cleaning up __http__(4)@172.19.67.24:60016
      I1215 01:19:16.247814 120034 process.cpp:2382] Spawned process (2)@172.19.67.24:60016
      I1215 01:19:16.247823 120038 process.cpp:2392] Resuming (2)@172.19.67.24:60016 at 2015-12-15 01:19:16.247822080+00:00
      I1215 01:19:16.247836 120038 exec.cpp:80] Scheduling shutdown of the executor
      I1215 01:19:16.247820 120039 process.cpp:2392] Resuming __gc__@172.19.67.24:60016 at 2015-12-15 01:19:16.247816960+00:00
      I1215 01:19:16.247843 120034 exec.cpp:396] Executor::shutdown took 17355ns
      I1215 01:19:16.247849 120038 clock.cpp:260] Created a timer for (2)@172.19.67.24:60016 in 5secs in the future (2015-12-15 01:19:21.247843072+00:00)
      I1215 01:19:16.247855 120049 process.cpp:2392] Resuming (1)@172.19.67.24:60016 at 2015-12-15 01:19:16.247849984+00:00
      15/12/15 01:19:16 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
      15/12/15 01:19:16 INFO MemoryStore: MemoryStore cleared
      15/12/15 01:19:16 INFO BlockManager: BlockManager stopped
      15/12/15 01:19:16 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
      15/12/15 01:19:16 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
      15/12/15 01:19:16 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
      15/12/15 01:19:16 INFO ShutdownHookManager: Shutdown hook called
      
      .... no more java logs
      

      Notice in the former case with dirty directories left over, there is a SIGTERM intercepted by the executor, and there are no log messages indicating disk cleaning occurred. In the later case, the shutdown hooks for memory and block managers are seen.

      This is using Mesos 0.25.0

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                apachespark Apache Spark
                Reporter:
                drcrallen Charles R Allen
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: