Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2909

Establish a metrics naming convention

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.18
    • 1.21
    • metrics
    • None

    Description

      I revisited Nutch metrics counters and put some metrics documentation together for others to consult should they wish.

      I thought a comprehensive collection of all Nutch Counters would be useful so I put together a metrics table. One of this (unintended) outcomes was that this highlighted the variability in counter group names and metric names. For example

      Metric Group:

      • CleaningJobStatus - upper camel case
      • CrawlDB filter - inconsistent use of capitalization and space separated
      • N/A - the DomainStatistics counters don't belong to a metric group
      • injector - lowercase named after the encapsulating Class
      • WebGraph.outlinks - inconsistent use of capitalization and period separated

      The Metric Name's are basically the same... pretty much all over the place.

      I am keen to bring some convention to the Nutch metrics definitions but this is not all plain sailing. I do understand that existing users may rely upon the above metrics as are and changing the values would have impacts downstream.

      PROPOSAL
      I would like to discuss introducing a naming convention which follows some simple principles motivated by a Datadog employees response on SO.

      As a take on that post, I want to propose the following

      1. With regards to Metric Group the highest level of hierarchy is the product line or the process i.e., nutch. The highest level of hierarchy is always lowercase.
      2. The next level of hierarchy is the sub-component/tool, i.e., nutch.Injector, nutch.Generator, nutch.ParseSegment, nutch.SitemapProcessor, etc. This constituent is exactly as that of the enclosing Class. This way it is really simple to trace the metric back to the Class which it was defined within.
      3. The third level of the hierarchy is the metric group which is a general grouping of functionality for the metric being defined i.e. nutch.QueueFeeder.fetcher_status. This constituent is lowercase with words separated by underscore. If no obvious metric group exists simply provide the enclosing Class in lowercase i.e., nutch.Injector.injector.urls_filtered
      4. With regards to the Metric Name, the last level of hierarchy is the thing being measured i.e., urls_filtered, above_exception_threshold_in_queue, etc. Everything is lowercase and words separated by underscore. Same as #3 above.

      Example complete metrics

      • nutch.Injector.injector.urls_filtered
      • nutch.ResolverThread.update_host_db.checked_hosts
      • nutch.WebGraph.outlinks.added links

      It would be greatly appreciated if folks could chime in on the details of the proposal. I'm sure there are several areas which could be improved.

      I will mention that my specific driver for cleaning this up is that I would like to push Nutch metrics into Enterprise Splunk so that the Nutch crawler subsystem will be integrated with all the rest of the subsystems I am responsible for. We use Splunk for that kind of thing. I intend to do that by implementing the Java statsd client but I feel that comes after we clean up metrics and establish a metrics naming convention.

      Thanks for any input.

      Attachments

        Activity

          People

            lewismc Lewis John McGibbney
            lewismc Lewis John McGibbney
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: