Another goal is to keep the metrics package usable stand-alone,
independent of Hadoop. Therefore, if Hadoop currently needs only a
subset of the functionality, it would make sense to create any
encapsulation of that subset somewhere outside of the metrics package,
unless it is clear that a large percentage of non-Hadoop users will want
that particular subset. But let us put aside this API issue for now,
and focus on the higher priority issue which is getting useful metric
data out of Hadoop. For that, we need to use counters, and counters
won't work with the Metrics.report method.
We have a few classes which each wrap a MetricsRecord: MapTaskMetrics,
TaskTrackerMetrics etc. Some use Metrics.report and others use the
MetricsRecord API directly. The latter should be OK. We should get rid
of the report method (there are about 2 dozen calls) and use the
MetricsRecord API instead. As an alternative to using a callback, we
could just split the MetricsRecords into smaller ones, where each
corresponds to an "event". E.g. we currently have a MetricsRecord
called "map" containing metrics input_records, input_bytes,
output_records and output_bytes. This could be split into two, one for
input data and one for output data. This makes sense because the input
related metrics and output related metrics are updated at different times.
This splitting approach would work for everything, but would be a bit
inelegant in the case of e.g. JobTrackerMetrics where there are 6
counters that need to be independently incremented. We could shorten
the code by using a callback.
Should I submit a patch with these changes?