Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19111

S3 Mesos history upload fails silently if too large

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.0.0
    • None
    • EC2, Mesos, Spark Core

    Description

      2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped Spark web UI at http://REDACTED:4041
      2017-01-06T21:32:32,938 INFO [SparkListenerBus] com.metamx.starfire.spark.SparkDriver - emitting metric: internal.metrics.jvmGCTime
      2017-01-06T21:32:32,939 INFO [SparkListenerBus] com.metamx.starfire.spark.SparkDriver - emitting metric: internal.metrics.shuffle.read.localBlocksFetched
      2017-01-06T21:32:32,939 INFO [SparkListenerBus] com.metamx.starfire.spark.SparkDriver - emitting metric: internal.metrics.resultSerializationTime
      2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(
      364,WrappedArray())
      2017-01-06T21:32:32,939 INFO [SparkListenerBus] com.metamx.starfire.spark.SparkDriver - emitting metric: internal.metrics.resultSize
      2017-01-06T21:32:32,939 INFO [SparkListenerBus] com.metamx.starfire.spark.SparkDriver - emitting metric: internal.metrics.peakExecutionMemory
      2017-01-06T21:32:32,939 INFO [SparkListenerBus] com.metamx.starfire.spark.SparkDriver - emitting metric: internal.metrics.shuffle.read.fetchWaitTime
      2017-01-06T21:32:32,939 INFO [SparkListenerBus] com.metamx.starfire.spark.SparkDriver - emitting metric: internal.metrics.memoryBytesSpilled
      2017-01-06T21:32:32,940 INFO [SparkListenerBus] com.metamx.starfire.spark.SparkDriver - emitting metric: internal.metrics.shuffle.read.remoteBytesRead
      2017-01-06T21:32:32,940 INFO [SparkListenerBus] com.metamx.starfire.spark.SparkDriver - emitting metric: internal.metrics.diskBytesSpilled
      2017-01-06T21:32:32,940 INFO [SparkListenerBus] com.metamx.starfire.spark.SparkDriver - emitting metric: internal.metrics.shuffle.read.localBytesRead
      2017-01-06T21:32:32,940 INFO [SparkListenerBus] com.metamx.starfire.spark.SparkDriver - emitting metric: internal.metrics.shuffle.read.recordsRead
      2017-01-06T21:32:32,940 INFO [SparkListenerBus] com.metamx.starfire.spark.SparkDriver - emitting metric: internal.metrics.executorDeserializeTime
      2017-01-06T21:32:32,940 INFO [SparkListenerBus] com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes
      2017-01-06T21:32:32,941 INFO [SparkListenerBus] com.metamx.starfire.spark.SparkDriver - emitting metric: internal.metrics.executorRunTime
      2017-01-06T21:32:32,941 INFO [SparkListenerBus] com.metamx.starfire.spark.SparkDriver - emitting metric: internal.metrics.shuffle.read.remoteBlocksFetched
      2017-01-06T21:32:32,943 INFO [main] org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' closed. Now beginning upload
      2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray())
      2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray())
      2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray())
      

      Running spark on mesos, some large jobs fail to upload to the history server storage!

      A successful sequence of events in the log that yield an upload are as follows:

      2017-01-06T19:14:32,925 INFO [main] org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp'
      2017-01-06T21:59:14,789 INFO [main] org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' closed. Now beginning upload
      2017-01-06T21:59:44,679 INFO [main] org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' upload complete
      

      But large jobs do not ever get to the upload complete log message, and instead exit before completion.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              drcrallen Charles R Allen
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: