Reliability engineering services specialists always assume that errors and bugs are inevitable. Moreover, troubles are expected. The real question is about the cost of failures for a business. There is a big financial difference between the seemingly similar uptime periods: 99% application availability implies 3.5 days downtime per year while the one of 99.99% implies only 50 minutes downtime per year.
This is about uptime management solutions aimed at achieving the highest possible availability of your application. Uptime monitoring can be considered as a subdivision of broadly defined application performance management. How critical is it to arrange effective service uptime monitoring? Which practices refer to uptime management? And who is in charge of making your software product to be always on?
These and some other relevant questions are addressed in the present post dedicated to one of the most burning topics of the DevOps paradigm in general and SRE functionality in particular.
Evolution Never Stops: SREs succeed DevOps
DevOps appeared to keep the Agile culture evolving when cloud computing along with microservice architectures became common practice for software developers. Reducing the gap between development and operation seems to be the prime objective there. However, DevOps determine just the general behavior of everyone involved in the process without delivering explicit prescriptions on how to achieve it.
SRE (Site/System Reliability Engineer) specifies DevOps concepts with particular methods and practical recommendations including uptime management techniques. SRE focuses on updates and bug-fixing having to appear as timely as seamlessly. That’s why SRE connects developers, admins, and business owners to harmonize the error threshold before deployments. This leads to the smooth operation of all services amid the continuously increasing needs of users.
The so-called CALMS doctrine of DevOps is what SRE converts into practice:
- Culture: encouraging to move forward due to failures are getting less costly;
- Automatization: automating manual operations while focusing on what brings long-term benefits to the system;
- Lean: maintaining a proper balance between uptime and downtime in relation to new releases;
- Measurement: making precise measurements of the app availability, downtimes, active loads, etc;
- Sharing: co-ownership of both code and infrastructure due to using the same tools and techniques.
Higher Uptime: How SREs Achieve This
Big-Data projects and various SaaS (PaaS, IaaS) providers usually hire system reliability engineers to admin and config servers. But SRE duties are going far beyond server administration and, therefore, can be applied to keeping any sort of app development sustainable. The availability of services along with the accessibility of IT infrastructures constitutes the desirable uptime management in which SREs directly participate. What do they usually do in such a context?
- Creating documentation and keeping it up-to-date. Since each moment of downtime results in financial loss, the fastest possible reaction is crucial. That’s why SRE creates the so-called runbook specifying the sequence of actions to be taken in quickly addressing urgent problems. There is a list of systems to be checked in a particular case;
- Optimization of the entire technological stack from programming code and up to a data-center architecture. Typically, SREs come from either experienced developers or admins with powerful programming backgrounds. That’s why they know very well what and when can fail in both development and administration. They review code to block any deployment that increases the complexity of a system needlessly. They can veto potentially dangerous updates at the very least.
- Selection and implementation of the new technologies. SREs deal with particular products and services in the context of the whole IT complex. That’s why they need to select new technologies with regard to the company’s strategic development.
- Monitoring of the product’s availability with specific metrics and indicators. This is exactly what can be called uptime management. And it’s worth diving deeper into the details.
Uptime Metrics & Indicators
One of the major inconsistencies between developers and admins lies in different attitudes concerning the system (application, service, website) reliability. The latter is everything for admins while developers use to taking it easy. The SRE approach implies combining the interests of both groups. For such a purpose, a common definition of reliability (availability, uptime) is to be specified.
An agreement upon the Service-Level Objective (SLO) indicates the uptime metrics. Google recommends accepting the lowest affordable availability threshold. The more reliable the system becomes, the more expensive it appears. Hence, determine the lowest possible uptime you can afford to indicate it in SLO. “Affordable” in such a context means the downtime level your users can easily ignore.
To make everything clear, the agreement should contain certain figures. This is about the Service Level Indicator (SLI) that can imply bandwidth, response time, number of errors, as well as any other metric relevant to a particular product.
In formulating uptime objectives, the application reliability of all ecosystem components should be taken into consideration. A user having a smartphone with a reliability of 99% can find no difference between the availability level of 99.99% and the one of 99.999% inherent in your application. In other words, 9 out of 10 failures of your app belong to the operating system. That’s why one more app failure per year will remain unnoticeable.
Uptime application management includes two widely accepted metrics: MTBF and MTTR.
Mean Time Between Failures (MTBF) indicates an average period between two consecutive failures of an app. The quality of code seems to be the prime factor of MTFB. SREs can impact it with their ability to say “no” against new deployments and updates that do not meet the reliability requirements.
Mean Time To Recovery (MTTR) is an average period needed to recover the app’s availability after a failure. If SLO indicates the app’s uptime of 99.99% per quarter, the app’s team has only 13 minutes to fix all downtimes over 3 months. It means the entire SLO budget can be spent for a single incident having 13 minutes MTTR. Such a period is very short for humans while scripts can cope with the same problem in seconds. Automation is crucial for MTTR, therefore.
The Observability criterion is directly linked to uptime monitoring. This is the metric that shows how quickly you can determine what is going wrong along with the system status at that moment. From the code perspective, Observability implies understanding at which service an error takes place along with the state of internal variables there. From the infrastructure perspective, it shows failed areas (a crashed pod if you use Kubernetes, for example).
Observability refers to uptime monitoring through MTTR: the higher the observability your service has, the easier the way you can find failed items in. Higher observability implies shorter MTTR and simpler uptime recovery, therefore.
Uptime Improvement Experiments
Achieving 100% application uptime is unlikely to be a good idea: this is expensive, technically challenging, and, oftentimes, meaningless – most probably the end-users will not appreciate your effort due to the problems they face with “neighboring” systems. Teams have to take risks to some degree, therefore. The so-called error budget appears, as a result, to help developers come to terms with SREs.
If your error budget has not been depleted yet, a certain space for experiments appears. Uptime monitoring can go in parallel with various experimental forms of application performance management as well:
- release new performance-impacting features;
- system maintenance;
- pre-planned downtimes;
- testing apps right in production, etc.
Netflix calls this method Chaos Engineering and offers some specific utilities to practice it. Chaos Gorilla, for example, can switch off one of the AWS sectors. Sounds weird, but a failed server is normal in the context of uptime management, it can bring no harm to your business (if your error budget is not empty, of course). What does such a method as Chaos Engineering help improve?
- Detecting hidden dependencies when it is unclear what impacts what (especially actual for microservices);
- Finding code errors undetectable with staging. Any staging is not a precise simulation: different load patterns, different equipment, etc;
- Disclosing infrastructural shortcomings that can be explored with neither staging nor CI/CD pipelines.
All that stuff meets the “blameless postmortem” culture when instead of blaming anybody for happened failures, their underlying causes are exposed to analyzing to improve processes.
Understanding whether the development team meets SLO is impossible without monitoring. That’s why SREs have to set up uptime monitoring to receive notifications exactly when measures are required. There are three urgency levels for various events:
- Alerts: something demands immediate actions (fix it right now!);
- Tickets: pending actions are required (gonna have to do something, but not necessarily over the next few minutes);
- Logs: no actions are needed, in the best-case scenario nobody checks logs (one of your microservices failed last week, logs could reveal what happened).
Uptime monitoring is aimed at detecting which events require actions. SREs describe what actions have to be done. Ideally, everything ends up with automation since the latter begins with reactions to events.
Uptime monitoring can proceed with various monitoring software. The global market offers different systems both SaaS and on-premise ones. They can be free and fee-based. They can be highly specialized products for a particular domain and universal all-in-one solutions. It makes little sense to describe certain monitoring software in the present post since plenty of relevant reviews is available on the internet. Besides, each application is unique in terms of all the above-mentioned factors (SLO, error budget, hardware ecosystem, etc) to have an individual uptime management policy.
Application uptime Management: Conclusion
Uptime management is never a whim inherent in rich projects where applications are orchestrated with expensive SREs with exotic software. Application uptime performance management works for teams of any size if they need to release updates, make infrastructural changes, and scale up their business.
Small companies and startups have no need to hire standalone system reliability engineers to arrange uptime management for their applications. The SRE position can be transitory. Besides, nurturing an own in-house SRE makes sense in many cases. What is really crucial implies following SRE principles in uptime management and monitoring.
It might start small: determine SLO, SLI, SLA, and adjust uptime monitoring. Discussing SLO, for example, leads to unexpected revelations sometimes: it turns out that a company spends excessive time and effort on keeping particular processes. It brings no harm to realize that making mistakes is natural.
Establish your error budget and use it for target spending. Analize all failures and downtimes to achieve the results allowing to implement a certain sort of automation. But first of all, it is worth finding a professional consultant capable of recognizing particular techniques of uptime management applicable just to your project. Contact us today if you are looking for a custom uptime monitoring service for your application.
Dmitry has 5 years of professional IT experience developing numerous consumer & enterprise applications. Dmitry has also implemented infrastructure and process improvement projects for businesses of various sizes. Due to his broad experience, Dmitry quickly understands business needs and improves processes by using established DevOps tools supported by Agile practices.