Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23943

Improve observability of MesosRestServer/MesosClusterDispatcher

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersStop watchingWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Incomplete
    • 2.2.1, 2.3.0
    • None
    • Deploy, Mesos
    •  

       

    Description

      Two changes in this PR:

      • A /health endpoint for a quick binary indication on the health of MesosClusterDispatcher. Useful for those running MesosClusterDispatcher as a marathon app: http://mesosphere.github.io/marathon/docs/health-checks.html. Returns a 503 status if the server is unhealthy and a 200 if the server is healthy
      • A /status endpoint for a more detailed examination on the current state of a MesosClusterDispatcher instance. Useful as a troubleshooting/monitoring tool

      For both endpoints, regardless of status code, the following body is returned:

       

      {
        "action" : "ServerStatusResponse",
        "launchedDrivers" : 0,
        "message" : "iamok",
        "queuedDrivers" : 0,
        "schedulerDriverStopped" : false,
        "serverSparkVersion" : "2.3.1-SNAPSHOT",
        "success" : true,
        "pendingRetryDrivers" : 0
      }

      Aside from surfacing all of the scheduler metrics, the response also includes the status of the Mesos SchedulerDriver. On numerous occasions now, we have observed scenarios where the Mesos SchedulerDriver quietly exits due to some other failure. When this happens, jobs queue up and the only way to clean things up is to restart the service. 

      With the above health check, marathon can be configured to automatically restart the MesosClusterDispatcher service when the health check fails, lessening the need for manual intervention.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned Assign to me
            adobe_pmackles paul mackles
            Votes:
            0 Vote for this issue
            Watchers:
            2 Stop watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment