Logging

What and how should you be logging?

The importance of adding and managing logging in any application cannot be overstated. If used correctly, logs can contain a wealth of information about a specific event or failure. Logs can tell us not only when the event took place but also provide us with details as to why it has occurred.

However, to ensure that application logs are actually useful, there are some basic requirements. First and foremost, they need to be digestible. What I mean is that they need to be structured in a way that both humans and machines can read. Complying with this requirement requires careful planning and strategizing but also ensures that searching and analyzing the logs will be easier and can actually be automated to a great extent.

Here, I will be discussing some tips and best practices I use when working with development teams when setting up logging for their applications.

What is your logging goal?

The most important part of planning a logging strategy is understanding your goal. What is it that you are trying to achieve with logging? Is it business intelligence? Is the purpose development, debugging,and failure analysis?

This might seem like common sense advice; but, in practice, failing to accurately understand your goals results in noisy and hard to read logs along with resources wasted on analyzing them.

Your goals will affect literally everything about the way that you log - from what you actually capture to how it is formatted and analyzed.

What should you log?

In my role, and through the years, I've seen just about every logging approach you can imagine. At some companies, I've seen teams take the approach to log EVERYTHING under the assumption that it will all be valuable at some point. Fact is, that in many cases, I’ve seen this “log as if there is no tomorrow” approach result in: a) logs that are so noisy that it is just about impossible to find anything, b) logging that is so pervasive thatit actually impacts the application being logged, and, c) needlessly expensive logging costs for storage and retention.

Once you’ve understood what you’re aiming to achieve with your logging, you can begin devising a strategy that defines what you want to log, and perhaps even more importantly — what you don’t want to log.

Choose and use a uniform logging framework!

Logging frameworks give developers a way to implement logging in a consistant, standard and easy-to-configure way. They allow developers to control verbosity, define log levels, configure multiple connections or targets, devise a log rotation policy, and so forth. Almost all will also allow your teams to enable and disable logging as needed.

You could, of course, build your own logging framework, vver the years I've certainly done that myself. But, why do that if there are already tools out there that are easy to use, have community support, and most importantly — do the job perfectly well? Many of the available logging frameworks have been around for some time, and you would be hard pressed to match their versatility or robustness. They are available in just about every language imaginable and there is no longer any reason to roll your own.

So one of the most important decisions you and your teams will make is which logging library, or framework, to use. This task can be complicated at times and pretty time consuming as there are such a large number of these tools available. Key considerations when selecting one should be ease of use, community, feature-richness, and the impact on your application’s performance.

Standardize your approach

The more standardized your approach to logging, the easier it is to find what you need and analyze them. The logging framework you end up using will help you with this, but there is still plenty of work to be done to ensure your log messages are all constructed the same way.

For starters, be sure developers understand when to use each log level. This will help avoid situations in which different log messages are assigned the same severity or where critical events go unnoticed because of the wrong severity assigned to it.

Second, create a standard for formatting and naming fields. We’ve come across the same error logged in a totally different way depending on who added the logging message - this should NOT happen. These logging standards include everything from deciding the format you want to use for the timestamp field and whether you will format your logs in JSON or as key=value pairs.

In that spirit, I will attempt to set a few standards that I have found useful and should act as a starting point when you are deciding how to implement logging in your application(s).

Create entries that humans can actually read

Avoid using complex encoding or layouts that make even the most basic information unintelligible. Specifically, developers like the ability to read logs for details of events while at the same time we want them to be processed by any monitoring system we may be using. Developer-friendly formats like JavaScript Object Notation (JSON) are readable by humans and machines.

Use clear and unambiguous key-value pairs

One of the most powerful features of proper logging is the ability to extract fields and create structured data that can be filtered and pivoted against. Use the template strategy for logging whenever possible.

Start small

It is much better to start with a small number of logs and then add to them over time rather than to start with a large collection and trying to winnow it down. Let teams get used to using their logs, and let them add more where and when they see value. Avoid the noise problem.

Be concise and descriptive

Each logging statement should contain both data and description. See this as an example:

log.debug("Message processed");
log.debug(message.getMessageID());
        
log.debug("Message with id '{}' processed", message.getMessageID());

You will notice here that the final logging statement contains far more information in a single line. It is much easier for a human to read it and understand what is happening. It is also more useful because, depending on the destination of where your logs are written, they may well be separated by dozens or hundreds of lines.

Use timestamps on every log statement

The correct time is critical to understanding the proper sequence of events. Your selected framework should automatically timestamp logging events.

Use unique identifiers

Unique identifiers such as transaction IDs and user IDs are tremendously helpful when debugging, and even more helpful when you are gathering analytics. Unique IDs can point you to the exact transaction. Without them, you might only have a time range to use. When possible, carry these IDs through multiple touch points and avoid changing the format of these IDs between modules. That way, you can track transactions through the system and follow them across machines, networks, and services.

It is also a good idea to use a correlation ID when you have a need or desire to correlate responses from other systems.

Log more than DEBUG statements

Put semantic meaning in events to get more out of the data. Log audit trails, what users are doing, transactions, timing information, and so on. Log anything that can add value when aggregated, charted, or further analyzed. In other words, log anything that is interesting to the business.

Identify sources and destinations

Include the source and, if appropriate, the target of the log event. This information should include system, sub-system, application domain, type, and any context information available such as transaction ID, etc. The goal here is to provide useful context in which a logging event occurred. Why do this? It will help you determine: what went wrong, how badly it went wrong, in the case of recovery shortly describe how (especially on WARN level), and provide possible context / description for possible causes of the error.

Use Logging Levels

Categorize any logging events. For example, use severity values like DEBUG, INFO, WARN, ERROR, and FATAL. This is part of almost any logging framework and will automatically be included. There is more information on logging levels below.

Avoid string concatenation

String concatenation is expensive and resource consuming, use substitution parameters instead.

Watch out for external systems

If you communicate with an external system, consider logging every piece of data that comes out from your application and gets in. Period.

Log exceptions

Logging exceptions is one of the most important roles of logging. At a minimum, every Try/Catch should include a logging statement. Don’t swallow exceptions unless you are 100% certain that you don’t need to handle them (this is rare).

Do NOT log sensitive information

This includes obvious things such as passwords, but it also includes business information that the log user doesn’t need to know.

Logging in production should only be set to WARN or above

You want to have exception/problem information available for first-pass problem determination in a production level enterprise application without turning on debug as a default log level. There is simply too much information in debug to be appropriate for day-to-day operations.

Don't try and optimize until you need to

Pre-optimization of logging performance is typically wasteful – don’t completely ignore performance while logging, but, in general, logging performance is good and will not substantially affect your code if you are using it responsibly.

Logging Levels

Categorize any and all logging events appropriately. For example, use severity values like VERBOSE, DEBUG, INFO, WARNING, ERROR, and FATAL. The selected level is typically part of any logging framework and will automatically be included. The typical logging levels from highest to lowest are:

FATAL / CRITICAL

Severe errors that cause premature termination and critical errors causing complete failure of the application. Expect these to be immediately visible on any status console because they need to be dealt with immediately.

Considerations: This is an overall application or system failure that should be investigated immediately. Yes, wake up the Admin. Since we prefer our Admins alert and well-rested, this severity should be used very infrequently. If it's happening daily and that's not a very big deal, it's lost it's meaning. Typically, a Fatal error only occurs once in the process lifetime, so if the log is tied to the process, this is typically the last message in the log.

ERROR

Something terribly wrong had happened, that must be investigated immediately. Indicates a failure within the application or connected system. No system can tolerate items logged on this level. Example: NPE, database unavailable, mission critical use case cannot be continued. Expect these to be immediately visible on any status console.

Considerations: This is definitely a problem that should be investigated. Admin should be notified automatically, but doesn't need to be dragged out of bed. By filtering a log to look at errors and above you get an overview of error frequency and can quickly identify the initiating failure that might have resulted in a cascade of additional errors. Tracking error rates as versus application usage can yield useful quality metrics such as MTBF (Mean time Between Failures) which can be used to assess overall quality. For example, this metric might help inform decisions about whether or not another beta testing cycle is needed before a release.

WARNING

The process might be continued, but take extra caution, typically used as indicators of possible issues or service/functionality degradation. This level of logging comprises two distinct types of information: one for obvious problems where work-around exists (for example: “Current data unavailable, using cached values”) and second for potential problems and suggestions. The application can tolerate warning messages, but they should always be justified and examined. Examples might include: use of deprecated APIs, poor use of API, 'almost' errors, other run-time situations that are undesirable or unexpected, but not necessarily "wrong", “Application running in development mode” or “Administration console is not secured with a password”. Expect these to be immediately visible on any status console.

Considerations: This MIGHT be problem, or might not. For example, expected transient environmental conditions such as short loss of network or database connectivity should be logged as Warnings, not Errors. Viewing a log filtered to show only warnings and errors may give quick insight into early hints at the root cause of a subsequent error. Warnings should be used sparingly so that they don't become meaningless. For example, loss of network access should be a warning or even an error in a server application, but might be just an Info in a desktop app designed for occasionally disconnected laptop users.

INFO

Events of interest or that have relevance to outside observers; the default enabled minimum logging level. Examples include: important business process has started or finished. In an ideal world, an administrator or advanced user should be able to understand INFO messages and quickly find out what the application is doing. Other definition of INFO message: each action that changes the state of the application significantly (database update, external system request). Expect that it is likely for these to be visible on any status console.

Considerations: This is important information that should be logged under normal conditions such as successful initialization, services starting and stopping or successful completion of significant transactions. Viewing a log showing Info and above should give a quick overview of major state changes in the process providing top-level context for understanding any warnings or errors that also occur. Don't have too many Info messages. We typically have < 5% Info messages relative to Verbose.

DEBUG

Internal control flow and diagnostic state dumps to facilitate pinpointing of recognized problems. This level should not normally be enabled in a production environment unless you are attempting to troubleshoot a problem.

Considerations: This level and verbose are used almost exclusively by developers to troubleshoot specific problems. The information here exists at a detailed level about logical operations that are occurring within the code that can be used by the developer to determine the flow of a process through the system.

VERBOSE

Tracing information and debugging minutiae; generally only switched on in unusual situations. Examples include performance level logging and loop counters.

Considerations: This is the the most detailed level of logging and will only be used by developers with a deep understanding of the code to troubleshoot intricate problems. This is typically the most commonly used severity level for things like performance metrics and logging specific details about internal state of functions in the environment.

When choosing an appropriate log level for a particular logging entry it is best to consider how the information will be used by the consumer of a log. The Audience matters! Administrative users will typically only see logging levels at INFO and higher. The higher the level , the more importance that it should take in their mind and be more specific and actionable with the messages as the level increases. While developers will be able to see all logging levels, the information found in the DEBUG and certainly the VERBOSE log levels is mostly of use to them and should be targeted appropriately.

Additional Resources

There are countless articles from both vendors and thought leaders on what are logging best practices and which framework you should be using. I prefer to keep it simple and focus only on things that are universal. The answer as to which frameworkto use and what your exact standards should be are really a matter for your team to decide. After all, they understand their needs bettwe than anyone else. Just make sure that it is uniformly applied.

. . .

Logging is not just writing to a file any longer. To effectively use it is not complicated - it just takes planning.

Back to Monitoring →