General consideration

The monitoring checks every single component and resources to detect incident before it has impact on the service. Monitoring check is like a unit test for software. It checks a single element with different alert threshold to generate alarm for example when a disk is 80% full rather waiting it was completely full and the service crash.

A monitoring check should be easy to reproduce and the process to solve the process should be described using a simple inference rules. If you can’t write simple inference rule, perhaps your test is too complex or just useless.

But we always have some monitoring checks that are define just as guardrail just to detect strange behavior like CPU or Memory usage. When those alarms reach their warning level, we don’t really know what is the problem: too much load, an attack, a bug; but we know that we have one.

Monitoring check is something alive. Every time an issue occurs that have not be detected by monitoring checks, you have to decide if you need to enhance your check to prevent future issues.

Devops path

Rather writing a document to describe how to solve incident when alarm occurs, it will be better to have a tool to analyze alarm and display action to execute to solve them. It could allow defining alarm hierarchy that will help to eliminate noise alarm and keep focus on the real one. For this part you need to decide to invest some time and money to enhance the quality of your system.

But there are other ways to enhance your monitoring system without investing money or time. You just have to think monitoring from the beginning. For each of your components you have to implements monitoring check to verify it could access each of its configured resources. When monitoring is not think, you have to implement external check for example to check that your PHP server is allow to connect to your MySQL database using mysql Nagios plugin. But that way if you change the PHP configuration you also have to update the Nagios one and you can miss some alarm if you have a software failure.

So the good way is to implement monitoring API for each resources configured on your component to allow connect check directly on your software rather adding an external plugin. And as it’s your check; think about the way to analyze the failure.

And as other components will need to check your component is alive, think to implement a way to check it. But keep in mind that checks have to stay elementary. You have full elementary checks for each resource, don’t need to recheck them before saying you are alive.

But don’t forget that for each alarm you set you have to provide a document describing what to do to solve the problem. Having a tool to define check and to start to fill a knowledge database is a good idea.

Post Install check

General consideration

Post install checks a kind of self-diagnostics check implement on each component. It uses the configuration and software function to check that each resource configured and needed by the component is usable.

Even you use continuous delivery process you have to provide a way to quickly check that all your components are well configured and are able to communicate with their resources. Having those post install checks is quicker and safer than executing functional no regression tests after the installation of a new release.

Devops path

The component checks implement to test that every thing work could be use to perform good configuration tests. Having a global API function that check component is alive and then that it has access to all of its resources is a good way to provide those checks. If you use tools like Ansible for deployment it could be useful to use those monitoring check to see if the component is well configured and then run the ansible playbook to configure it if needed.

But that also mean that your component have to be able to start and answer to the API status request even if its resources are not well configure.

Organized Logs

General consideration

Logging information is not just putting a stack trace in a file. Logs are the diary of your system to keep trace of the history of its life. It will help to understand problem that has occurs and alert about abnormal use of the system but also track activities of a user.

When developer checks the logs to debug their software, there is often the only user on the service. So it easy to know which log is generate by which action, but if your service is in production you could have a lot of users executing different actions simultaneously. So in order to trace user action, every logs entry should provide enough information to identify its context. You also need a way to analyze your logs to find the answer you search even there are a lot of logs.

Not all the logs are store for the same purpose, some are stored to track usage for statistics or legal uses, some to notify a wrong usage and user to signal a failure on the system.

Debug logs that could be use only by developer try to execute one only request step by step are useless and have to be removed. Remove all noisy logs.

Devops path

To provide a simple way to manage your logs, using flat files is not a good idea. Using new technology like Elasticsearch we have now some solution to provide simple logs component that allow managing logs in a databases with powerful search function. So it’s not a bad idea to think about a log component API to centralize all your logs.

But storing the logs and providing a way to dig on it is not enough to have good quality of support. You have to normalize them to provide the right context information to know what to do with them.

For example you have to define the log level scale according of the kind of process to manage those log. Some logs are used to notify a component failure, for example a component failed to send a request to another. This kind of logs has to be clearly identified to generate monitoring alert to allow operators to solve the problem. Some failure could generate system impacts that need to be solving by support team like a data not well set on all components.

If you have multiple components it’s always good idea to start to document the workflow of each user request; and explain what kind of logs could be generate at each step with the impacts on the data of your services. A better way is to provide those information in way you will be able to use to generate diagnostics process.

Functional Diagnostics

General consideration

There is some common tests the support team has to execute when a user complain that an issue occurs. Usually it’s checking all the user data is well filled. One common way to provided those function to the support team it to give the admin rights to access the databases. When they are lucky, a support portal is build to allow them to access data to check them. And when they are really lucky the tool provide some automatics tests to detect common problem they can ran to diagnostic user complaint.

Devops path

Most of functional issue origin is the service data that is no more consistent. If your storage of data allows inconsistent data the question is not “what will occurs if data become inconsistent? “ but “what will occurs when data will become inconsistent”. So a good way to prevent issue is to use storage constraint when possible to guaranty data consistence like foreign key on RDBMS. But if your storage system doesn’t provide that kind of constraint, or if you store data in different storage systems a good idea is to write a document that describes the consistence rules. Based on those rules you can write a tool to check automatically the data to detect and solve problem before user complain.

A POI on Devops path

If you take care of what will be the support needs from the beginning of your project you will see that a lot of run problem will not occurs. It’s just like “test driven development”; it will change the way you will architecture your software and improve its quality following some easy rules: Develop a monitoring API to check each resource Organize the logs Think how detect data inconsistence