🙇🏻 🌒 🔼 Best Practices for Logging Enterprise Applications (From a Support Engineer's Perspective) 🌓 🙁 🕶️

Application logs reveal information about external and internal events that the application sees during execution. When a bug, hack or anomaly occurs during a deployment, the logs are the most useful and reliable evidence for analyzing the causes of the incident.

Let's take a look at how to write useful messages in the journal that everyone will love.

1. What to log

Incoming and outgoing messages

If the components interact with each other using messages, then you need to record incoming and outgoing messages indicating the URLs of the API endpoints, request parameters, initial and intermediate IP requests, request headers, information about the author, request and response bodies, business context , timestamps and internal processing steps.

It is very important that each message is assigned a unique identifier (usually generated at the API manager or service level). This is necessary to track the exchange of messages between services in the system.

Calling services and functions

When calling a service or function, it is desirable to log their context in more detail, mainly for debugging (use TRACE or DEBUG). These logs will help you troubleshoot business logic issues, especially if you don't have the privilege to attach a debugger to an application (for example, when deploying to a test, staging, or pre-prod environment).

User actions and business statistics

Each application has unique business scenarios and user journeys that provide a lot of information to specialized professionals. For example, whether a certain transaction has been taking too long, or whether users are stuck on some functionality - all this is very important data from the point of view of user experience. Other information related to the business - the number of transactions and active users, as well as their stages - is important for finding cause and effect relationships, and can even be used in business analysis.

Data Operations (Audit Trail)

For security and compliance reasons, most enterprise applications require separate transaction logs with all important information such as access IDs (users and systems), exact service instances and role privileges used, timestamps, data requests, snapshots and the new state of the changed data (diff). The audit trail should record all operations with data (access, import, export, etc.), as well as CRUD operations (create, read, update, delete) performed by users, other systems and services.

System events

These include information about behavior (starts, stops, restarts and safety-related events), transient modes (cold, warm-up, hot), interservice communication (handshake, connection establishment statuses - connected, disconnected, reconnected, retries), identifiers service instances, actively serving APIs, actively listening on IP and port ranges, loaded configurations (bootstrap and dynamic updates), overall service state, and anything else that can help you understand system behavior.

Performance statistics

Diligence is a great characteristic of computing devices, but they may not perform perfectly. At any time, there may be performance issues or sudden unexpected degradations in service (mostly due to unhandled errors and corrupted data). To determine these, it is always recommended to publish statistics on the overall health and performance of the system. It can contain information like API call counters (successful and failed), network latency, average roundtrip duration, memory consumption, and other application-specific information (usually determined by the business context).

Threats and vulnerabilities

Disclosing threats and vulnerabilities with runtime applications and logs is an art that any enterprise software developer must master. Hacks and glitches usually don't happen suddenly. Most often, there are signs that no one notices at first. Therefore, you should always log suspicious human activity (for example, erroneous attempts at authentication and verification with the application of all low-level information such as used networks, sources of requests, user roles and privileges), as well as system behavior (for example, an increase in peaks in resource consumption patterns, high load on web servers, occasional service failures). When you notice a suspicious event, make sure the logs contain all the information related to it. Perfectly,so that it is a full-stack trace with parameter values and additional information obtained from the application context.

2. What you don't need to log

Personal information

Nearly all privacy laws (eg GDPR, CCPA) explicitly recommend developers not to log personally identifiable information. This includes name, nicknames, gender, birthday, street address, email, phone numbers, social security number, and credit card numbers.

Company names and contact information

Make sure you don't write down company names, employee, customer, supplier information, or company and individual contact information. The journal should never disclose business relationships and transactions with third parties. To track specific transactions, use system-generated event IDs instead of real names and pass them on to other services.

Financial data (bank accounts, bank card details, amounts sent, etc.)

By law, all financial data must be completely removed or masked. Disclosure of such information in magazines can easily lead to serious legal action (up to criminal liability). Avoid this by all means.

Passwords, security keys and secrets, authentication tokens

Credentials and authentication tokens are considered confidential information, so their presence in logs will help attackers easily find holes in the system. Therefore, do not let this data into the logs.

Note: You can easily determine what information to hide from the logs if you add an attribute to each field that determines the level of visibility (for example, show, mask, hide, encrypt). If you have such a mechanism, then you can change the visibility of the fields by simply updating the properties in the configuration. This is a good solution in cases where you need to log some user information in non-combat environments, especially for testing and debugging. Or, you can write parsers that filter logs and process sensitive fields according to pre-written instructions for the environment.

3. Best practices

Know when to use a particular level of logging

The log level is used to indicate the severity of each element in the system. Most logging frameworks have these levels:

FATAL : very serious errors that will most likely cause the application to terminate. This usually ends up with serious failures.
ERROR : errors with which the application can still continue to work, but with the deterioration of certain capabilities.
WARN : less dangerous events compared to errors. They usually do not degrade or crash the application. But these are still important events that need to be investigated.
INFO : important event banners and informational messages in application behavior.
DEBUG: , . .
TRACE: , , . .

Linux Syslog also has more serious levels such as Emergency, Alert, Critical, Error, Warning, Notice, Informational, and Debug.

Regardless of the complexity and depth of each level of logging, we must properly configure them in our code to provide the optimal amount of information in each scenario. For example, all data used by developers for debugging and technical analysis should go at DEBUG or TRACE levels, and banners with system data should go below INFO.

Use English

Some tools and consoles do not support the output and storage of logs with certain Unicode characters. Therefore, localization and other improvements can be difficult. Stick to English and always use a widely supported character set for writing messages.

2020-11-11 13:52:18 DEBUG  App:36 - Loading adapters..
2020-11-11 13:52:21 DEBUG  Adapters:10 - Loading adapter docs/v1
2020-11-11 13:52:22 DEBUG  Adapters:16 - Loading adapter mongo/v1
2020-11-11 13:52:26 DEBUG  Docs:38 - docs adapter initialized
2020-11-11 13:52:27 DEBUG  Mongo:38 - mongo adapter initialized
2020-11-11 13:52:22 DEBUG  Adapters:20 - Successfully loaded all

Add developer-friendly messages (concise and meaningful)

If there is too little logging, we will not be able to get the information we need to recreate the context of every important event. And if you log too much, it will degrade performance: writing a huge log file will increase I / O and disk storage resource consumption. If the file system does not support it, it will reduce overall system performance.

To optimize messages, you need a clear understanding of the functional and non-functional expectations of the system and planning for the desired quality and quantity of messages. Each journal should be meaningful and appropriate to the context: always write briefly and to the point.

2020-11-11 13:52:27 DEBUG  Users:18 - Successfully created new user (RecordID: 5f5a0d594d17e22b848ee570)2020-11-11 13:52:27 ERROR  Users:34 - Failed to update DB (E: inactive user, RecordID: 5f5a0d594d17e22b848ee570)

Create reference identifiers, aliases and simplified templates for frequently used and long messages

Instead of writing down a full description each time, try creating reference identifiers or simplified templates to represent long, repetitive descriptions in the log. This reduces the total number and length of messages, and also increases the flexibility to hide certain information. For example, instead of describing a vulnerability in a text log, it is better to use an alias or identifier so that only specialized experts can understand the current scenario.

2020-11-11 13:52:27 ERROR  DB:22 - Failed to write (E:ORA-01017)// ORA-01017 denotes "invalid username/password; logon denied" error message

Use correct timestamps

Timestamps allow you to understand the sequence of events, they are necessary for debugging and analysis. When fixing time, it is recommended to use the most detailed values (for example, at the level of milliseconds or microseconds) so that it is easier to identify adjacent events. Also, make sure the timestamps are at the beginning of the message in the yyyy-mm-dd HH: mm: ss format. Always specify your time zone unless you are using the server's default time (UTC).

// with default server time (UTC)2020-11-11 13:52:12 INFO  XYZ Integration API Manager v2.0.0// with timezone (e.g. PST - Pacific Standard Time)2020-11-11 13:52:12PST INFO  XYZ Integration API Manager v2.0.0

Provide source or origin of log data (for DEBUG, TRACE, ERROR)

This is very useful for clearly identifying the location that led to the corresponding message. Some logging frameworks allow you to specify sources at the most detailed level (down to the name of the files with line numbers), but more often than not, mentioning only the class, function, or file name is sufficient.

2020-11-11 13:52:12 INFO  app - XYZ Integration API Manager v2.0.0
2020-11-11 13:52:15 INFO  app - Loading configurations..
2020-11-11 13:52:18 INFO  app - *** InstanceID APIM_V2_I02
2020-11-11 13:52:18 INFO  app — *** BaseURL http://10.244.2.168:3000
2020-11-11 13:52:19 INFO  app - *** LogLevel 05 (DEBUG)
2020-11-11 13:52:12 DEBUG  App:22 - Initializing Swagger UI..
2020-11-11 13:52:14 DEBUG  App:28 - Generating schemata..
2020-11-11 13:52:14 DEBUG  App:30 - Initializing REST services..
2020-11-11 13:52:15 DEBUG  App:32 - Generating documentation..
2020-11-11 13:52:18 DEBUG  App:36 - Loading adapters..
2020-11-11 13:52:21 DEBUG  Adapters:10 - Loading adapter docs/v1
2020-11-11 13:52:22 DEBUG  Adapters:16 - Loading adapter mongo/v1
2020-11-11 13:52:26 DEBUG  Docs:38 - docs adapter initialized
2020-11-11 13:52:27 DEBUG  Mongo:38 - mongo adapter initialized
2020-11-11 13:52:22 DEBUG  Adapters:20 - Successfully loaded all
2020-11-11 13:52:31 INFO  app - Started listening...

Each journal must be unique within the system

Most newbies make the same mistake - they copy and paste the sample message in different files, collecting the final log from the same lines coming from different parts of the system. In this case, it is difficult to track down the specific location that triggered the event. If the set of words cannot be changed, then at least mention the source in the message so that the lines in the final file differ from each other. In addition, if the logging is handled by the parent class, then send an identifier during initialization and use it to record messages about the behavior of the child classes.

2020-11-11 13:52:18 DEBUG  App:36 - Loading adapters..
2020-11-11 13:52:21 DEBUG  Adapters:10 - Loading adapter docs/v1
2020-11-11 13:52:22 DEBUG  Adapters:16 - Loading adapter mongo/v1
2020-11-11 13:52:26 DEBUG  Docs:38 - docs adapter initialized
2020-11-11 13:52:27 DEBUG  Mongo:38 - mongo adapter initialized
2020-11-11 13:52:22 DEBUG  Adapters:20 - Successfully loaded all

Add a tracked ID or message token to the message

When an event or message hits the system, it is usually assigned a unique identifier. It can be passed to other processing stages to track the movement of an event through the system, which is useful for debugging, concurrent processing, and data-related operations. Ideally, the system should have a unique event identifier within all modules and services.

[87d4s-a98d7-9a8742jsd] Request Body: {
 "request_id": "87d4s-a98d7-9a8742jsd",
 "app_id": "TX001",
 "option_val": "IBM",
 "req_type_id": "0013",
 "data": {...........}
[87d4s-a98d7-9a8742jsd] Sending request to RefData: href="http://10.244.2.168:8280/v1

Match identifiers at transition points

In certain cases, especially when transmitting an event from one system to another, the identifiers change in accordance with the convention adopted in the other system. At such transition points, it is necessary to explicitly indicate the correspondence between the old and the new identifier in a separate message, with the addition of the necessary context.

[1000508] ***** Incoming request from 10.244.2.168:3000 *****
[1000508] Origin Id: 87d4s-a98d7-9a8742jsd -> System ID: 1000508
[1000508] Transaction successfully added to Rabbit Queue

Specify identifiers of all service instances

Most enterprise systems are distributed computing platforms in which there are many instances of the same services with various application configurations, resources, versions, and network properties. It is recommended that you assign identifiers to instances and use them for interservice communication.

2020-11-11 13:52:12 INFO  app - XYZ Integration API Manager v2.0.0
2020-11-11 13:52:15 INFO  app - Loading configurations..
2020-11-11 13:52:18 INFO  app - *** InstanceID APIM_V2_I02
2020-11-11 13:52:18 INFO  app — *** BaseURL http://10.244.2.168:3000
2020-11-11 13:52:19 INFO  app - *** LogLevel 05 (DEBUG)

Configure an active logging level

The active logging level should be changed depending on the deployment environment. For production it is recommended to output logs up to INFO level. In other environments, logs are output down to the DEBUG or TRACE level, depending on the level of detail required by the development and maintenance teams.

// Active Log Level = DEBUG2020-11-11 13:52:12 INFO  app - XYZ Integration API Manager v2.0.0
2020-11-11 13:52:15 INFO  app - Loading configurations..
2020-11-11 13:52:18 INFO  app - *** InstanceID APIM_V2_I02
2020-11-11 13:52:18 INFO  app — *** BaseURL http://10.244.2.168:3000
2020-11-11 13:52:19 INFO  app - *** LogLevel 05 (DEBUG)
2020-11-11 13:52:12 DEBUG  App:22 - Initializing Swagger UI..
2020-11-11 13:52:14 DEBUG  App:28 - Generating schemata..
2020-11-11 13:52:14 DEBUG  App:30 - Initializing REST services..
2020-11-11 13:52:15 DEBUG  App:32 - Generating documentation..
2020-11-11 13:52:18 DEBUG  App:36 - Loading adapters..
2020-11-11 13:52:21 DEBUG  Adapters:10 - Loading adapter docs/v1
2020-11-11 13:52:22 DEBUG  Adapters:16 - Loading adapter mongo/v1
2020-11-11 13:52:26 DEBUG  Docs:38 - docs adapter initialized
2020-11-11 13:52:27 DEBUG  Mongo:38 - mongo adapter initialized
2020-11-11 13:52:22 DEBUG  Adapters:20 - Successfully loaded all
2020-11-11 13:52:31 INFO  app - Started listening...// Active Log Level = INFO2020-11-11 13:52:12 INFO  app - XYZ Integration API Manager v2.0.0
2020-11-11 13:52:15 INFO  app - Loading configurations..
2020-11-11 13:52:18 INFO  app - *** InstanceID API_V2_I02
2020-11-11 13:52:18 INFO  app — *** BaseURL http://10.244.2.168:3000
2020-11-11 13:52:19 INFO  app - *** LogLevel 04 (INFO)
2020-11-11 13:52:31 INFO  app - Started listening...

Provide Sufficient Context for Errors and Crashes

Errors and failures require the most thorough investigation. To do this, the application must provide the subject matter experts with the information they need, as well as the technology and business context. For example, if a request or message could not be processed, it is very useful to log the body of the failed request in addition to the error.

[1000508] ***** Incoming request from 10.244.2.168:3000 *****[1000508] Origin Id: 87d4s-a98d7-9a8742jsd -> System ID: 1000508[1000508] Failed to validate msg body (E: Uncaught ReferenceError: missing params - option_val)[1000508] Failed Message: {
 "request_id": "87d4s-a98d7-9a8742jsd",
 "app_id": "TX001",
 "req_type_id": "0013",
 "data": {...........}
}

Back up data transactions with evidence (don't assume!)

For every data operation, do not take it for granted that it was successful. Always check the end condition with evidence. For example, when you create, update, or delete a record in the database, it returns a counter of the changed records and the updated record itself. Always check the expected counter or result. Another example: when you insert a record into a data structure (for example, into a queue), then check if its size has increased. Suppose your system uses concurrent operations, but the queue does not support them. Then you can lose some records, and the only way to detect such hidden errors in the code is to check the length.

DEBUG  BatchWriter:28 - Successfully connected to DB. Trying to insert 24 accounts...DEBUG  BatchWriter:35 - Successfully inserted 24 accounts. Total DB rows affected: 24

Encrypt or mask sensitive data

Usually, the law requires that confidential information of users and internal systems be masked and / or encrypted. Regulatory requirements may vary depending on your industry and place of work. Find out all the nuances and implement the correct procedures in the application. In some cases, you may need to submit your logging strategy to the security service and obtain their approval or certification before operationalization.

[1000508] ***** Incoming request from 10.244.2.168:3000 *****[1000508] Origin Id: 87d4s-a98d7-9a8742jsd -> System ID: 1000508[1000508] Request Body: {
”user_id”:”XXXXXXXXXX”,
”personal_details”:{
  ”firstName”:”XXXXXXXXXX”,
  ”lastName”:”XXXXXXXXXX”,
  ”DOB”:”XXXXXXXXXX”,
  ”gender”:”Male”,
  ”proffessional”:”Software Architect”,
  ”industry”:”IT”,
  ”SSN”:”XXXXXXXXXX”
},
”address_history”:[
  {”streetAdress”:”Street No 1″,”zipcode”:”XXXXX”,”state”:”CA”},
  {”streetAdress”:”Street No 2″,”zipcode”:”XXXXX”,”state”:”NY”},  
  {”streetAdress”:”Street No 2″,”zipcode”:”XXXXX”,”state”:”AL”}
],
”card_info”:[
  {”type”:”amex″,”card_number”:”XXXXXXXXX”,”credit_limit”:”XXXXX”},
  {”type”:”visa″,”card_number”:”XXXXXXXXX”,”credit_limit”:”XXXXX”}
]
}

Configure log file naming, rotation policies, storage sizes, and backup procedures

If you write logs to files, make sure they are stored on a separate disk that does not affect the running application in any way (for example, you can allocate a volume on a remote server and attach it to the application server). Also find out the frequency of journaling and how the files grow. Make sure you have a log rotation policy with correct naming conventions (for example, adding a timestamp to the name when creating) to maintain the ideal size and number of files. You should also have a mechanism for backing up old logs to a safe place, and a mechanism for regularly cleaning up storage. Depending on your industry, you can set up a timed backup (usually every few months or years), deleting all previous files at the end of the period.

[ec2-user@ip-XXX-XX-X-XXX logs]$ ls
..
APIM_V2_I02-2020-11-20_04:38:43.log
APIM_V2_I02-2020-11-23_02:05:35.log
APIM_V2_I02-2020-11-24_04:38:17.log
APIM_V2_I02-2020-11-27_03:28:37.log
APIM_V2_I02-2020-11-27_12:06:45.log
...

4.

A very common practice among enterprise application developers is to create a centrally accessible server or location to collect logs. Typically, such aggregators track logs not only of applications, but also of devices and operating systems (for example, Linux Syslog), networks, firewalls, databases, etc. In addition, they separate log files from application servers and allow storing logs in more protected, orderly and efficient formats for a longer time. In some industries (such as banking and finance), it is necessary to store logs in both local and central storage so that attackers cannot gain access to both places and simultaneously remove evidence of their activities. So data redundancy and data mismatch between the two stores can help spot intrusions.

Write parsers and prudently track logs

The ability to write parsers and filters is built into most log monitoring tools - the so-called security information and event management (SIEM) integration . Parsers help keep logs in more streamlined formats, and query data becomes much easier and faster. Also, properly organized data can be transferred to monitoring and anomaly search systems for proactive monitoring and predicting future events. These tools are very powerful in graphing data based on time series and in real time.

Set up alerts and push notifications for critical incidents

Almost all log monitoring tools allow you to set specific thresholds for certain levels. When the system reaches these values, the tool detects them in advance, helps log data, and notifies sysadmins through alerts, push notification APIs (such as Slack Audit Logs API ), emails, etc. They can also be pre-configured to initiate automated processes. like dynamic scaling, system backups, load slinging, and so on. But if you invest in commercial log monitoring software, take a good look at it because such tools can be overkill for small to medium sized software systems.

5. Recommended tools

Logging frameworks

JavaScript / TypeScript: Log4js / pino

Java: Log4j

Golang: Logrus

Serverless-functions: aws-lambda-powertools-python / Self-written

Exploring, Aggregating and Monitoring Logs

CLI tools: less, vim

Cloud tools: Fluentd, AWS CloudWatch

SIEM tools: SolarWinds, Splunk, McAfee ESM, DataDog, IBM QRadar

Others: ELK Stack (ElasticSearch, Logstash, Kibana, Beats), Loggly

Best Practices for Logging Enterprise Applications (From a Support Engineer's Perspective)