In the IT industry, logs provide invaluable insights into system behavior, performance, and security, enabling timely troubleshooting and data-driven decision making. As a result, generating a vast quantity of logs is often considered a valuable goal in itself. However, the indiscriminate logging of every single step of your code can lead to chaos in log storage, failing to deliver the expected benefits of log collection. In this article, we’ll look at best practices for log generation, collection, and analysis to help you get the most from your logs.
Logs are short messages that capture significant events within a software system, along with associated metadata. Log collection refers to the generation, aggregation, and storage of the historical data represented by the logs.
Typically, log messages are generated within a software’s source code or by infrastructure components. These messages are either stored locally on disk or sent to a dedicated server; in both locations, the log entries are processed, stored, and analyzed.
The main use cases for log collection include:
- Troubleshooting bugs: Log messages can help to reconstruct the sequence of events leading to a bug and provide useful data that gives context.
- Detecting errors: You may be unaware of a specific bug or failure until an anomaly appears in your log files. Monitoring logs helps detect errors and system malfunctions.
- Investigating security incidents: Unauthorized access attempts, cyberattacks, and other suspicious activity can be revealed by logs, which track unusual events happening in the system.
- Usage analytics: Logs can mark various milestones or steps in users’ interaction with your services and applications, enhancing your understanding of how they use your software.
Server applications often use logs to analyze how their API is used, monitor outages, and measure latency when exchanging data between subsystems, or the system and the user, in order to recognize performance bottlenecks. In mobile apps, it’s common to use logs while investigating crash reports and analyzing A/B testing results.
The entire process of logging can be divided into two categories of activities: log collection and log analysis. The first group encompasses everything that produces log messages, including logs generation and saving them to a file or sending them to a remote storage. The second group relates to activities on the logs consumer side—logs storage, processing, combining, filtering, and, ultimately, their analysis. In some cases, the edge between two groups is blurred, therefore some practices recommended in this article affect the entire process, not just the group they logically relate to.
In order to get the most from your logs, it’s important to follow certain best practices. Here are some simple practices to ensure that your logs are useful tools for the efficient maintenance of your app, instead of a pile of unsorted data with no practical value.
Good results start with a good plan, and logs are no exception. Although logs can be introduced at any stage of the application lifecycle, planning what you want to log before you start writing code optimizes the process. This allows you to integrate logging seamlessly where it’s useful, efficient, and maintainable.
It’s important to include all system components and code modules in your logging. Otherwise, you may find a bug located in an area that’s only partially logged or not logged at all, and then you’ll either need another way to tackle the issue or you’ll have to add logging retroactively and wait until the problem occurs again. Adding new logs on the go, redeploying the system, and waiting until the elusive bug reoccurs—while your users are expecting a fix—does not inspire trust in your company because it’ll result in subpar user experience.
Good logs have good structure. Here are four elements that lend structure to logs, and the reasons it’s important to keep these four things in mind while generating log messages:
To navigate easily through large amounts of data, log data should be organized systematically. Using different categories for different subsystems enables you to filter logs for the specific part of the application that is relevant to your current analysis.
Virtually all log systems support different groups of messages, which are commonly referred to as log levels. While different log systems may use different names for levels or offer a slightly different number, a number of levels are common across systems.
- Debug logs help software developers with the problem at hand by providing technical and specific information, such as the context necessary to reproduce a bug.
- Info logs are usually bound to certain events related to the software’s business value, such as starting and stopping a service or creating and removing a file. Such logs are useful when gathering statistics and analyzing various usage scenarios.
- Warning logs warn about potentially dangerous events in the system or circumstances that might lead to such events. For instance, they will notify you if a disc has almost run out of space.
- Error logs are exactly what their name suggests: the description and meta-data of the errors that happen in your system. For example, a mobile application can generate an error log when its backend server is not available.
- Fault or fatal-level logs show critical failures of the software that prevent its proper functioning. They usually mean that an intervention from an engineer is required. For instance, a microservice can log a fault when its connected database is down.
Using log levels consistently allows entries to be filtered, limiting the output to the necessary minimum. Together with categorizing, this would make it possible, for example, to display warnings related solely to the database layer.
Some log systems allow you to add custom tags to log messages. Similarly to how categories help distinguish log messages produced by different subsystems, tags allow you to group your log messages by custom criteria, for example, “A/B Testing” or “Performance.”
Adhering to a well-known formatting structure, such as JSON or XML, makes processing and storing of the log data more efficient. However, there’s a catch: A fixed format imposes rigid constraints on the log message. Thus, applying a strict schema to each log message might result in partial loss of context, which is less likely with a free-form message.
Software developers are sometimes reluctant to write any kind of documentation, including log messages, and generating consistent log messages requires discipline and long-term commitment. However, using established terminology and unified formatting always pays off, because it helps you skim through vast amounts of log messages more easily and reduces the possibility of human error.
Reading logs can be a challenge even when they are properly organized. At the very minimum, using units and date formats consistently is necessary if you don’t want to spend hours reading scattered entries, and editing messages that use different terms for the same concept.
Before adding logging, it’s worth deciding which data will be useful for your use case. Relevant data provides your log messages with context. For example, attaching unique identifiers to user API requests makes it easier to find information about a specific request. Even if a log message is in plain text, consistently including a timestamp will eventually give the message context during log filtering and analysis.
On the other hand, it’s equally beneficial to avoid generating unnecessary log messages. Logging irrelevant data introduces noise, slows down the search for relevant information, and wastes both your and your users’ disk space.
The key to remember when it comes to relevance is that detailed and meaningful messages are the core of your log entries. Even if they are collected, stored, and processed by sophisticated automated systems, ultimately logs are read and interpreted by humans, so they need to convey relevant meaning.
When planning which data to incorporate into logs, users’ private information should be excluded or at least encrypted. Some log systems can redact personal details, such as names or credit card numbers. This obfuscates sensitive data but still allows log entries containing a specific encrypted identifier to be collated.
If sending sensitive data to your log server is unavoidable, precautions should be taken. The internet connection must be secure, data should be encrypted, and access to logs should be restricted to a select few individuals whose roles require it.
It’s also advisable to keep your software updated, because new patches often contain fixes to known security breaches. However, this comes with its own pitfalls: New updates sometimes include known issues, so always read release notes and bulletins.
Finally, if your company must comply with certain regulations, such as GDPR for companies operating within the EU, logs require particular attention. Regulations may require that certain data types, including logs, have a finite retention period.
Log analysis involves the set of activities related to reading, searching, and interpreting collected logs. While effective log generation and collection are integral parts of the efficient log system, they make up just half of its success. When it comes to actually using the resulting logs, analysis comes to the fore.
IT professionals generally use log analysis in a precisely targeted way, focusing on specific sections of the entire log to answer questions, analyze aspects of performance, or investigate incidents. For instance, the focus might be on logs related to a user session during the time when a particular bug occurred.
Given the challenges of processing large volumes of stored data and network delays in the case of remote log servers, log analysis has its own challenges. In this section, we’ll take a look at best practices that make the process easier.
After logs are generated and collected, they are stored in or sent to a logs storage. This is where IT professionals access generated logs to analyze them.
To facilitate efficient analysis, a good storage system for your logs should possess the following qualities:
- Friendly interface: Browsing log messages starts with accessibility; a cryptic API can easily make reading logs torturous. Choose a log system or service that makes your logs easily accessible, preferably one that has an intuitive user interface.
- Security: As log entries can contain sensitive data or business-critical information, access should be restricted. Implement a storage system that allows individual or role-based access control.
- Easy browsing: The storage should allow easy browsing, sorting, and filtering. Centralized storage allows logs to be easily and quickly collated from different subsystems, giving you a holistic view of your logs—crucial, given the growth of distributed systems and cloud-based services.
- Rotation: To comply with regulations and reduce costs, the storage system should support log rotation. This means that when storage limits are exceeded or retention periods expire, old or irrelevant data is automatically deleted.
- Scalability: Scalability will ensure that your logs will not be lost if your service usage grows rapidly. To limit the cost of storing vast amounts of data, a storage service might compress files containing log messages, especially if the data is old or is being stored only because of a retention policy.
- Indexing: Like other kinds of databases, logs benefit from indexing for faster access and more efficient filtering and sorting.
Since log messages may come from disparate sources, such as different modules of an application or different microservices of the server, data received by the log system may be in diverse formats. Many software systems use the legacy syslog format, while others have their own, e.g., Apple’s unified logging system, which is used in macOS and iOS applications. Data should be automatically converted to a consistent format when it is received and stored.
Modern log services support various filters and parsers that will format logs to one standard. However, data should follow the standards which are recognized by your log analysis tools.
An effective log analysis tool should allow the collation of log messages from different sources. This is crucial for investigation of incidents that occur and are logged in one subsystem, but are actually caused by a failure in another subsystem. Such scenarios require a deep understanding of the context and juxtaposition of all relevant events from all involved software components.
Regular monitoring of incoming log messages allows for quicker identification of unusual activity, prompt reactions to security incidents, and earlier handling of performance drops. As a result, cyberattacks can be averted and the service operates without interruption. This is especially important for services where even a brief downtime can result in significant loss of revenue, like the financial services industry or enterprises.
Nowadays, monitoring often includes machine learning, which adapts to the specifics of your use case and constantly learns to predict events by analyzing more and more data. With the help of machine learning, logging systems can detect patterns in log messages that could, for instance, be a sign of a cyberattack, but aren’t obvious to human interpreters.
In order to receive timely alerts from a managed logging system, log messages should be written into the system as close to real time as possible. While this may seem obvious, it’s not that simple to implement.
Log collection demands computational powers, and sending log messages to a server consumes network bandwidth. However, as log collection is not critical to the immediate user experience, it’s not always a priority. As a result, many systems send log messages using buffers and background queuing.
As often happens with IT systems, actual “real-timeness” is also a matter of a tradeoff: If logs are processed in the background with low priority, they may be analyzed and interpreted later than desired. On the other hand, if logging is given a high priority, the software’s responsiveness may suffer.
Managed logging systems, also known as logging as a service (LaaS,) are centralized storage systems for log collection and analysis. These services save companies the investment of building and maintaining their own solutions.
Managed logging systems typically offer flexible options for log storage and rotation, and a user-friendly interface for displaying, sorting, and filtering historical data. However, different systems offer different feature sets, so it’s advisable to familiarize yourself with what a service provides before making a final decision; switching between managed logging systems can be both costly and cumbersome.
Gcore Managed Logging stores logs collected from different sources and compiles them into a single, intuitive system that can be browsed using OpenSearch Dashboards. For better reliability, we use Kafka servers as an intermediary buffer and retain received log messages, ensuring they remain available even if the log source is currently down.
Log collection and analysis ultimately lead to a better user experience delivered by your services and applications. The best practices discussed in this article can help make working with the collected logs easier and more efficient.
Managed logging services, such as Gcore Managed Logging, help you to get the most from your logs. You get a centralized persistent storage that accumulates logs from all your services and displays them on a dashboard, where you can easily collate, filter and monitor your log messages.