Monitoring Approaches
Where do you even begin?
You may have noticed that I have a slight obsession with monitoring and logging. Along with solid standards and good practices, they are an integral part of a well architected system. After all, how else would you know when you are successful?
The main challenge with monitoring is that it is amazingly easy to add it at the beginning of a new project; but, not so easy to retroactively add it to existing projects. So where do you begin?
What is your monitoring goal?
The most important part of planning and implementing a monitoring strategy is understanding your goal. What is it that you are trying to achieve with monitoring? Is it to ensure user satisfaction or to preemeptively monitor and address performance issues? Is it business intelligence? Is the purpose development, debugging,and failure analysis?
Much like logging, this might seem like common sense advice; but, in practice, failing to accurately understand your goals results in noisy and hard to understand metrics along with resources wasted on analyzing them.
Your goals will affect literally everything about the way that you monitor and most definitely what to start with - from what you actually capture to how it is aggregated and used.
Where do you start?
When in doubt, I always start from the outside and work my way in. After all, the applications that we build serve actual purposes in the real world - and they have actual users. Start with what matters to your users.
Think about the fundamental things that you, as a user, would care about. Typically, that means starting with something as basic as how responsive is the application from the users perspective. Is the application or service even running? How long does it take to load? How long does page navigation take? Are there any pages that require queries and how "snappy" are they when you are the one sitting at a screen waiting on them to complete their work? In these cases, seconds can seem like minutes.
Of course, if you captured nothing but these basic timings, its utility is only of limited value. If the times went up, you wouldn't know why just based on that knowledge. What you would know is is that there is potentially an issue. But that is the key point - you now know. You didn't have to rely on or wait for a business user to let you know. You can start analyzing and potentially troubleshooting well before your users may even be aware there is an issue. After all, that is what monitoring is all about.
Given that basic timing information on your key applications, you can then use other tools that are easily and commonly available to begin the process of determining why it is slow. The data from these external sources can then be combined to provide a more holistic view of your environment. Essentially, you start by treating your applications as a black box and monitor those things flowing into it, out of it, and around it that matter from a users perspective. Only once you have this information in hand should you start the laborious process of adding internal monitoring to <<<<<<< HEAD your applications - peering into the box.
Think of it this way, does the user really care that a particular function inside your application used to take 4ms and now takes 8ms? Despite the fact that ======= your applications - peering into the box.
Think of it this way, does the user really care that a particular function inside your application used to take 4ms and now takes 8ms? Despite the fact that >>>>>>> a3d26e06e45e7378ecaa40b665405e027d6e9117 this is a 100% increase in time, we can probably say not. That is not something that they would notice. The same percentage increase for the return of a query result that they are waiting on would be something that is noticed.
I'm not saying that monitoring the internals of your application is not important. It most definitely is. Particularly when your developers or team is directly responsible for it in production. Those internal metrics can provide a great deal of insight into limitations and bottlenecks of your application. That internal data can be used to identify and help prioritize efforts when enhancing or extending your applications. I have a number of other detailed discussions on what you should think about when addressing internal monitoring and I won't repeat those here.
The key take-away that I want to impress is to start simple and then work your way in. Most metric-based frameworks and guidance concentrate on the more complicated internal capture process. I hope that they do this because they just tacitly assume that this external monitoring is already taking place and not because it is not important.
The goal of monitoring is to provide immediate benefits and then expand on those with more data as and only when needed.
Choose and use a uniform logging framework!
Logging frameworks give developers a way to implement logging in a consistant, standard and easy-to-configure way. They allow developers to control verbosity, define log levels, configure multiple connections or targets, devise a log rotation policy, and so forth. Almost all will also allow your teams to enable and disable logging as needed.
You could, of course, build your own logging framework, vver the years I've certainly done that myself. But, why do that if there are already tools out there that are easy to use, have community support, and most importantly — do the job perfectly well? Many of the available logging frameworks have been around for some time, and you would be hard pressed to match their versatility or robustness. They are available in just about every language imaginable and there is no longer any reason to roll your own.
So one of the most important decisions you and your teams will make is which logging library, or framework, to use. This task can be complicated at times and pretty time consuming as there are such a large number of these tools available. Key considerations when selecting one should be ease of use, community, feature-richness, and the impact on your application’s performance.
Standardize your approach
The more standardized your approach to logging, the easier it is to find what you need and analyze them. The logging framework you end up using will help you with this, but there is still plenty of work to be done to ensure your log messages are all constructed the same way.
For starters, be sure developers understand when to use each log level. This will help avoid situations in which different log messages are assigned the same severity or where critical events go unnoticed because of the wrong severity assigned to it.
Second, create a standard for formatting and naming fields. We’ve come across the same error logged in a totally different way depending on who added the logging message - this should NOT happen. These logging standards include everything from deciding the format you want to use for the timestamp field and whether you will format your logs in JSON or as key=value pairs.
In that spirit, I will attempt to set a few standards that I have found useful and should act as a starting point when you are deciding how to implement logging in your application(s).
Additional Resources
There are countless articles from both vendors and thought leaders on what are logging best practices and which framework you should be using. I prefer to keep it simple and focus only on things that are universal. The answer as to which frameworkto use and what your exact standards should be are really a matter for your team to decide. After all, they understand their needs bettwe than anyone else. Just make sure that it is uniformly applied.
. . .
Logging is not just writing to a file any longer. To effectively use it is not complicated - it just takes planning.