Customer experience is the new mantra, a mission statement, for any serious business in the cloud era. At the edge, where websites are expected to be always available, the customer experience, usability, and accessibility are determined. After all, reliability is what matters to users; an unused app is worthless to both users and businesses.
Every company these days is a technology company that drives transformational change. Cloud capabilities are central to a company’s entire business strategy. This presents them with a significant operational challenge. Any degradation in performance and customer experience will result in a loss of money, revenue, and opportunities, as well as a major problem with traditional operations [modus operandi] not being able to cope with observability, which includes real-time monitoring and alerting.
What is the purpose of Site Reliability Engineering (SRE)?
DevOps was born as a result of the agile movement, which emphasises the importance of cross-functional collaboration. DevOps is all about focusing on your company’s unique problems and challenges. It’s also about getting things done quickly, efficiently, and with high quality. It is, in essence, a culture, a movement, a philosophy of values, principles, methods, and practises aimed at achieving the organization’s desired outcome. This speed has caused some instabilities, as developers have been moving faster than ever before, posing a challenge to operations teams. The IT operations teams were unprepared to deal with such speed, resulting in significant bottlenecks and backlogs. They were unable to keep up with the pace, resulting in uncontrollable instability in production and unreliable systems. As a result, Google created a need for SREs, which are “a group of developers who can apply engineering expertise to operations problems.”
SRE is a DevOps methodology that is prescriptive. It’s a way of thinking about system administration tasks that focuses on the tenets of running a service by empowering developers and operations by shortening delivery and incident management lifecycles and reducing toil. An SRE team’s daily tasks include: –
- Availability
- Time delay
- Ability to perform
- Effectiveness
- Managing Change
- Monitoring and notification
- Immediate response
- Reaction to an incident
- Planning and preparation
- Planning for capacity
So, what is Site Reliability Engineering (SRE)?
The SRE team’s job is to run applications in “mission-critical systems” in production and do whatever it takes to keep the site up and running. A software engineer performing operations work is a common definition. The SRE teams are in charge of maintaining and setting service level indicators (SLIs), objectives (SLOs), agreements (SLAs), and error budgets for their systems, as well as ensuring that they are met. They are expected to devote a portion of their time to operational work (ensuring that systems function as expected) as well as improving the systems they oversee. SREs specialise in writing software that automates processes and reduces manual labour. Toil is defined as any manual activity on a system that isn’t currently automated.
SRE’s strategic objectives are to: –
- Ease deployment
- Increasing or maintaining uptime
- Gain insight into the application’s performance.
- Establish SLIs, SLOs, and error budgets.
- Quicken your pace by taking calculated risks.
- Get rid of the drudgery
- Reduce the cost of failure to reduce the time it takes to develop new features.
SLOs with consequences and SLIs
SLOs (Service Level Objectives) are simply agreements between SRE teams and product owners or line of business managers (LOB). The metrics are heavily influenced by the nature of the system that the team is in charge of. The quantitative measures defined for a system, also known as “what we are measuring,” are known as Service Level Indicators (SLIs). The indicators are determined by the management system. These indicators could be availability, request latency, or error rates for a typical web application. A Hyperledger Fabric Blockchain application, for example, might use the endorsement and ledger commit rates per second to determine the network’s throughput.
Multiple systems will be managed by SRE teams in the future. The team will be able to standardise monitoring, logging, and automation across the stack by defining a standard set of SLIs across various applications.
SLOs are the target value or range within which the system should operate in order to determine “how good it should be.” These are the expected operational values for the SLIs that were previously defined. A blockchain network, for example, must maintain a transaction throughput of 50 to 100 transactions per second with less than 5 seconds of end-to-end latency.
It’s possible that SLI and SLOs are overengineered. It is critical to keep things simple at first. As your understanding of the system improves, you’ll be able to set more stringent goals.
SRE team workloads are being reduced and controlled.
SRE teams perform toil, which is arduous, manual work. Whether you’re a software developer or an architect, you may find yourself doing tasks that you don’t enjoy on a daily basis. These are typically manual, monotonous, and repetitive tasks that can result in errors. Similar tasks will be required of SRE teams. However, this is one instance where SREs can put their development skills to good use and minimise the use of manual processes. It’s a good idea for SREs to devote up to 50% of their time to improving the systems they oversee.
Budget for errors
An error budget is a tool that SRE teams use to balance service reliability, and it’s calculated like this:
Availability = (Number of good events / Total events) * 100
Error budget = (100 — Availability) = failed requests / (successful requests + failed requests)
A service’s error budget is 100 minus the service’s SLO. A service with a 99.99 percent SLO has a 0.01 percent error budget.
Another example of a SLO is an error budget, in which each service is subject to its own SLA with a penalty clause. It determines the amount of space you have to meet your other SLOs. If your Service Level Indicator states that 99.99 percent of transactions must be committed to the ledger within 5 seconds, only 0.01 percent of transactions can take longer than 5 seconds. You might notice a slowness in the system the day after a major release and suddenly deplete all of your error budgets. Remember that the most common cause of outages is change, and launches are major sources of change.
If your error budget is consistently exceeded, you’ll need to revisit some of your SLOs and processes.
- Is it possible that you’re making too many changes in a single release? Please keep your releases simple and break them up into smaller chunks.
- Are the SLOs overly restrictive? It’s possible that you’ll need to negotiate and loosen the SLOs.
- Is there anything in your release process that needs to be done manually that’s causing issues? Attempt to automate and test as much as possible.
- Is the system’s architecture fault tolerant? Hardware can fail, network packages can go missing, and upstream or downstream applications can behave badly. The architecture of your system should be able to withstand failures.
- Is technical debt addressed by the development team? Technical debt is frequently overlooked in the rush to release new features.
- Is your monitoring and alerting system catching the early warning signs? Downstream incidents can be caused by increasing queue sizes, network slowdown, too many lead changes, and so on.
- Do you keep track of your logs on a regular basis and keep them clean? Your logs may contain warnings that do not cause immediate problems. However, when those warnings are combined with other infrastructure issues, they can result in major incidents.
The four golden signals for distributed system monitoring
The four golden signals of the SRE are some of the most important principles and best practises for developing successful monitoring and alerting systems. For large-scale production applications, they are an essential component of your service level objectives (SLOs). Their mission is to assist you in identifying and correcting any potential flaws in your system. They take proactive measures to address your infrastructure concerns. Whenever your operations team requires near-real-time visibility into problems and the ability to track latency, traffic, errors, and saturation for all services.
Monitoring and Alerting
Monitoring is a useful tool for observing how a system works, and alerts are events that can be triggered when a system is broken or is about to be broken. As a result, SRE teams must develop dependable and meaningful monitoring systems.
We can use tools to create effective monitoring systems. Prometheus is an open source event monitoring and alerting application. It uses an HTTP pull model to store real-time metrics in a time series database. Prometheus, for example, can be set up to collect data from Hyperledger Fabric Blockchain nodes.
Facilitate Postmortems
One important but often overlooked aspect of establishing an SRE function in your organisation is postmortem analysis; “postmortems are meant to be blameless.” It can be defined as an opportunity for a company to learn from its mistakes. Postmortems should be performed as soon as possible after the incident has been resolved. Components and applications will eventually fail in a complex enterprise IT environment. A deployment error, a software bug introduced in a recent release, or simply malfunctioning hardware could all be to blame for the failure. It’s critical to keep the knowledge gained in the enterprise by documenting the incident’s root cause, as well as the short and long-term fixes that were implemented and circulated among the development and SRE teams. The findings from an incident can be used as a preventive measure for other systems or as a point of reference for a future incident.
The goal of this blog post was to cover the basic concepts and techniques needed to build a successful SRE team. It went over how to proactively identify and remediate incidents by focusing on observability via improved metrics, logs, traces, and dashboards.