Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-11373

Add metrics to the History Server and providers

    Details

    • Type: New Feature
    • Status: In Progress
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.6.0
    • Fix Version/s: None
    • Component/s: Spark Core
    • Labels:
      None

      Description

      The History server doesn't publish metrics about JVM load or anything from the history provider plugins. This means that performance problems from massive job histories aren't visible to management tools, and nor are any provider-generated metrics such as time to load histories, failed history loads, the number of connectivity failures talking to remote services, etc.

      If the history server set up a metrics registry and offered the option to publish its metrics, then management tools could view this data.

      1. the metrics registry would need to be passed down to the instantiated ApplicationHistoryProvider, in order for it to register its metrics.
      2. if the codahale metrics servlet were registered under a path such as /metrics, the values would be visible as HTML and JSON, without the need for management tools.
      3. Integration tests could also retrieve the JSON-formatted data and use it as part of the test suites.

        Issue Links

          Activity

          Hide
          stevel@apache.org Steve Loughran added a comment -
          1. This has tangible benefit for the SPARK-1537 YARN ATS binding, because connectivity failures, GET performance and similar do surface. There are some AtomicLong counters in its YarnHistoryProvider, but I'm not planning to add counters and metrics until after that is checked in.
          2. All providers will benefit from the standard JVM performance counters, GC &c.
          3. the FS history provider could also track time to list and load histories; time of last refresh, time to load most recent history, etc —information needed to identify where an unresponsive UI is getting its problems from.
          Show
          stevel@apache.org Steve Loughran added a comment - This has tangible benefit for the SPARK-1537 YARN ATS binding, because connectivity failures, GET performance and similar do surface. There are some AtomicLong counters in its YarnHistoryProvider , but I'm not planning to add counters and metrics until after that is checked in. All providers will benefit from the standard JVM performance counters, GC &c. the FS history provider could also track time to list and load histories; time of last refresh, time to load most recent history, etc —information needed to identify where an unresponsive UI is getting its problems from.
          Hide
          charlesyeh Charles Yeh added a comment -

          I could work on this but I need help getting started. I think I need to add specific source types for history provider subtypes. Does this sound about right?
          1. create a HistorySource and add it to a new historyMetricsSystem
          2. add tracking in FsHistoryProvider (i.e. in the checkForLogs function) for time to load histories, failed history loads, the number of connectivity failures talking to remote services, etc.
          3. in HistorySource, register the metrics

          Show
          charlesyeh Charles Yeh added a comment - I could work on this but I need help getting started. I think I need to add specific source types for history provider subtypes. Does this sound about right? 1. create a HistorySource and add it to a new historyMetricsSystem 2. add tracking in FsHistoryProvider (i.e. in the checkForLogs function) for time to load histories, failed history loads, the number of connectivity failures talking to remote services, etc. 3. in HistorySource, register the metrics
          Hide
          stevel@apache.org Steve Loughran added a comment -

          I have in my head roughly how to do this; in SPARK-1537 I've got more complex metrics being collected.

          'd have the providers themselves register their metrics; they'd just be given the registry and told to do it. I'd do this by adding a new method to the base class, start(BindingInfo), where BindingInfo would be a class with currently just one entry, "metrics registry". (I'd do it that way so that we could add more binding info without breaking plugins in in future).

          In FsHistoryProvider.start(BindingInfo) I'd move all the thread-starting code from the constructor. Starting threads there is trouble, especially for subclassing (and yes mock tests). It could also add some new values.

          For the YarnHistoryProvider, I've already got some counters —they're just atomic longs in the class. In the publisher code, I've factored out these counters, switched them to Codahale Counter classes, and then register them

          That's what I'd do in the providers: let them make up their own metrics and register them.

          Now, the next fun issue is: how to publish this? That is: how to read in the config and have the server hook up its metrics? I'd actually like the default to just be to use the codahale metrics servlets, as I've found these great for functional "metrics first" tests —you manipulate the system and verify the metrics notice. The web servlets are trivial. Supporting hooking up to ganglia, graphite, systemd, ... etc: I have no idea where to begin

          Anyway, If you want to work on this, I'll try to help. I'll certainly help with the binding to the providers, and show you how to bind the codahale servlets. I'll leave it to you to work out how to do the broader metrics bindings

          Show
          stevel@apache.org Steve Loughran added a comment - I have in my head roughly how to do this; in SPARK-1537 I've got more complex metrics being collected. 'd have the providers themselves register their metrics; they'd just be given the registry and told to do it. I'd do this by adding a new method to the base class, start(BindingInfo) , where BindingInfo would be a class with currently just one entry, "metrics registry". (I'd do it that way so that we could add more binding info without breaking plugins in in future). In FsHistoryProvider.start(BindingInfo) I'd move all the thread-starting code from the constructor. Starting threads there is trouble, especially for subclassing (and yes mock tests). It could also add some new values. For the YarnHistoryProvider , I've already got some counters —they're just atomic longs in the class. In the publisher code, I've factored out these counters, switched them to Codahale Counter classes, and then register them That's what I'd do in the providers: let them make up their own metrics and register them. Now, the next fun issue is: how to publish this? That is: how to read in the config and have the server hook up its metrics? I'd actually like the default to just be to use the codahale metrics servlets, as I've found these great for functional "metrics first" tests —you manipulate the system and verify the metrics notice. The web servlets are trivial. Supporting hooking up to ganglia, graphite, systemd, ... etc: I have no idea where to begin Anyway, If you want to work on this, I'll try to help. I'll certainly help with the binding to the providers, and show you how to bind the codahale servlets. I'll leave it to you to work out how to do the broader metrics bindings
          Hide
          apachespark Apache Spark added a comment -

          User 'steveloughran' has created a pull request for this issue:
          https://github.com/apache/spark/pull/9571

          Show
          apachespark Apache Spark added a comment - User 'steveloughran' has created a pull request for this issue: https://github.com/apache/spark/pull/9571
          Hide
          stevel@apache.org Steve Loughran added a comment -

          Charles Yeh I've just put up a pull request of what I had in mind, with those basic fs metrics, and JVM & thread info.

          I couldn't hook this up to the spark metrics system as there wasn't one that could be used ... for now I've just gone direct to the codahale servlets and classes for registration.

          Your suggestion of a new history metrics system would be the right thing to do ... but I would really like those metrics to be fetchable as bits of JSON at the end of URLs —that's both enumerating the whole set and reading specific values. Why?

          1. lets me ask for performance stats from anyone with a web browser to hand, you can say "do a curl history:1800/metrics/metrics > metrics.json" and I've got something I can attach to bug reports.
          2. lets me write tests which query the metrics for the state of the provider, e.g. probe a counter of seconds-since-successful update to be between 0 and 60 before trying to list the applications and expecting them to be found. Or, after mocking a connectivity failure, verify that the failure counts have gone up.

          Anyway: the draft is up, I won't be working on it again for the next couple of weeks —if, after reviewing my patch you could take it and do a real spark history metrics system, that'd really progress it. And again, that's where the servlets would help: testing the metrics system itself.

          Show
          stevel@apache.org Steve Loughran added a comment - Charles Yeh I've just put up a pull request of what I had in mind, with those basic fs metrics, and JVM & thread info. I couldn't hook this up to the spark metrics system as there wasn't one that could be used ... for now I've just gone direct to the codahale servlets and classes for registration. Your suggestion of a new history metrics system would be the right thing to do ... but I would really like those metrics to be fetchable as bits of JSON at the end of URLs —that's both enumerating the whole set and reading specific values. Why? lets me ask for performance stats from anyone with a web browser to hand, you can say "do a curl history:1800/metrics/metrics > metrics.json" and I've got something I can attach to bug reports. lets me write tests which query the metrics for the state of the provider, e.g. probe a counter of seconds-since-successful update to be between 0 and 60 before trying to list the applications and expecting them to be found. Or, after mocking a connectivity failure, verify that the failure counts have gone up. Anyway: the draft is up, I won't be working on it again for the next couple of weeks —if, after reviewing my patch you could take it and do a real spark history metrics system, that'd really progress it. And again, that's where the servlets would help: testing the metrics system itself.
          Hide
          apachespark Apache Spark added a comment -

          User 'steveloughran' has created a pull request for this issue:
          https://github.com/apache/spark/pull/17747

          Show
          apachespark Apache Spark added a comment - User 'steveloughran' has created a pull request for this issue: https://github.com/apache/spark/pull/17747
          Hide
          stevel@apache.org Steve Loughran added a comment -

          metrics might help with understanding the s3 load issues in SPARK-19111

          Show
          stevel@apache.org Steve Loughran added a comment - metrics might help with understanding the s3 load issues in SPARK-19111
          Hide
          ndimiduk Nick Dimiduk added a comment -

          I'm chasing a goose through the wild and have found my way here. It seems Spark has two independent subsystems for recording runtime information: history/SparkListener and Metrics. I'm startled to find a whole wealth of information exposed during job runtime over http/json via api/v1/applications, yet none of this is available to the Metrics systems configured with with metrics.properties file. Lovely details like number of input, output, and shuffle records per task are unavailable to my Grafana dashboards fed by the Ganglia reporter.

          Is it an objective of this ticket to report such information through Metrics? Is there a separate ticket tracking such an effort? Is it a "simple" matter of implementing a SparkListener that bridges to Metrics?

          Show
          ndimiduk Nick Dimiduk added a comment - I'm chasing a goose through the wild and have found my way here. It seems Spark has two independent subsystems for recording runtime information: history/SparkListener and Metrics. I'm startled to find a whole wealth of information exposed during job runtime over http/json via api/v1/applications , yet none of this is available to the Metrics systems configured with with metrics.properties file. Lovely details like number of input, output, and shuffle records per task are unavailable to my Grafana dashboards fed by the Ganglia reporter. Is it an objective of this ticket to report such information through Metrics? Is there a separate ticket tracking such an effort? Is it a "simple" matter of implementing a SparkListener that bridges to Metrics?

            People

            • Assignee:
              Unassigned
              Reporter:
              stevel@apache.org Steve Loughran
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:

                Development