It can be very tempting to instrument all the things and have a lot of data, but that comes with a burden, both financial and also in terms of time spent developing. What would be the prime candidates for observing and generating metrics on?

What constitutes a metric?

A metric is a data point emitted by an application, usually with some additional information attached to it - generally referred to as tags.

Most modern applications are instrumented using APM - application performance monitoring. The APM integration creates traces - which highlight the transactions happening in your application, for example, when a user requests a page. Most APM instrumentation will also integrate with 3rd party addons - such as your database layer, caching layer, etc. One can generate excellent metrics from the automatically generated traces, which can be further manually annotated (for example: with user IDs).

Why add custom metrics?

The APM instrumentation will generate broad strokes metrics - it will tell you how many times you query the DB for each page load a user does and how long those queries took. However, it will not tell you more application specific, in most cases, business-specific metrics. An example of generating custom metrics from the code would be to measure how many times a particular condition is satisfied; an example I recently had was adding a metric for when objects older than 18 months were retrieved from our storage layer.

Covering such a case from an APM-generated metric would be possible, but it would mean we would have to annotate all transactions with the age of the object fetched from the storage layer, which would be a lot of effort; and also would generate a lot of metrics for things we might not care about - therefore ending up as a pretty costly option!

This example does not mean you should create custom metrics for everything but consider them when developing your next feature.

I tend to think about metrics in 3 broad categories depending on how they will be used:

  • Metrics used for product decisions/development
  • metrics used for monitoring and alerting
  • metrics used for performance tuning

Metrics to aid in product decisions and development

These are the metrics that are most commonly left out of development plans because such metrics would not directly shape the current or future state of the code the developer is working on. These metrics impact product decisions, and as most developers work in cross-functional squads, I think it is critical for any engineer also to adopt a product owner they are working on.

When developing a new feature determining adoption rates, usage rates, and usage patterns is almost always very difficult. This is where as developers, we can pre-empt some of these questions and think ahead of possible metrics, which would help answer some of these questions as we roll out the feature/product.

On top of easily transformable metrics from the APM (e.g., how many times the resource was requested, how long it took, etc.), we should have product-specific information:

  • How many records did it retrieve?
  • What was the average age of the document?
  • How many different users have accessed the feature? … and the list can continue and be tweaked depending on what the feature you are developing is.

When thinking of metrics that would fit this category, try to answer the question: what information is needed to inform the next decision in our product development roadmap?

Metrics used for monitoring and alerting

This is a critical set of metrics which are the ones that are most commonly part of any system, as they inform the overall health of our application.

Standard metrics here would be:

  • CPU/memory usage
  • Error rates
  • The number of different requests for each path by HTTP response code etc.

However, one can use more subtle metrics for monitoring, and thus alerting, that can and should be product specific. For example, you might know that you expect an average of three items in the basket for your particular product. If for some reason, that average starts drifting away from that number, it may be cause for concern.

When creating metrics that you wish to use for alerting, keep in mind that for alerts to be effective, they need to be actionable. Imagine having to wake up at 3 am and resolve an issue indicated by the metric. If there are no clear instructions, it will be a very tedious and error-prone process, as you are essentially relying on the 3 am judgement to make decisions; so metrics that are used to generate alerts should have obvious levers that can be pulled to resolve issues that arise from unexpected levels.

If you take the previous example - a metric for the average basket, the average basket size decrease or increase could be driven by a lot of factors (e.g., ongoing promotions for a specific item - so overall smaller baskets, free shipping if over a certain spend amount - resulting in more articles, etc.); which in turn makes for a bad metric to alert on as it’s not something that can be easily fixed. However, it is a great metric to monitor as an indicator of the system’s health.

Metrics for monitoring do not need to be used for alerting, so one can mix and match. When approaching alerting, think actionable; when approaching monitoring, consider product-specific top-line metrics.

Metrics used for performance tuning

Arguably, there is an overlap between the metrics used for monitoring and the metrics for performance tuning. I wanted to highlight that sometimes we want to add additional instrumentation to particular areas that we think might have a disproportionate impact on the overall performance of our application.

Several metrics are usually provided by the language-specific engine, for golang that would be things like:

  • Number of goroutines
  • How many times has the garbage collector (GC) ran
  • How long did the GC took
  • What is the size of the heap … etc

These metrics can help a developer inform of issues like memory leaks, lack of usage of pointers for shifting large amounts of data, concurrency issues due to too many goroutines/threads running, etc. These are all valuable metrics to have and inspect on a semi-regular basis and keep a close look on when doing anything like bumping underlying language versions.

Conclusion

To recap, think of metrics in 3 broad categories:

  • Used for furthering product development
  • Used for monitoring and alerting
  • Used for performance tuning the application

Setting aside time to consider what metrics would suit those three categories is vital. It ensures we have the right data to make correct decisions about engineering tweaks/fixes and future features. I suggest identifying at least one metric per category for any new feature you develop.

Something to consider is that there will always be a case of it would have been helpful to start collecting this data 3 months ago, so my recommendation is to timebox the effort that goes into selecting the metrics you want since there is a substantial likelihood that once you understand the usage patterns, you will be required to tweak your metrics. It is also essential to have a semi-regular review of your metrics, as you might find that particular things considered key metrics six months ago are no longer of interest, and thus give us a chance to clean up the metric collection and our dashboards.