7 Tips for Building and Maintaining an SRE Team in Your Company

In today's "always on" world, reliability is a primary business KPI. Establish a culture of reliability by implementing these 7 simple tips to build a solid SRE team in your organization.

Vishal Padghan

Jan. 10, 24 · Opinion

Like (2)

Save

2.1K Views

Many of today’s hottest jobs didn’t exist at the turn of the millennium. Social media managers, data scientists, and growth hackers were unheard of at the time. Another relatively new job role in demand is that of a Site Reliability Engineer or SRE. The profession is quite new. It’s noted that 64% of SRE teams are less than three years old. But despite being new, the job role adds a lot of value to an organization.

SRE vs. DevOps

Site reliability engineering is the merging of development and operations into one. Most people tend to mix up SRE and DevOps. By principle, the two intertwine, but DevOps serves as the principle and SRE the practice.

Any company looking to implement site reliability engineering in their organization might want to start by following these seven tips to build and maintain an SRE team.

1. Start Small and Internally

There is a high chance that your company needs an SRE team but doesn’t need a whole department right away. Site reliability management’s role is to ensure that an online service remains reliable through alert creation, incident investigation, root cause remediation, and incident postmortem.

The average tech-based company faces a few bugs every so often. In the past, operations and development teams would come together to fix those issues in software or service. An SRE approach merges those two into one.

If you’re just starting to build your SRE team, you can start by putting together some people from your operations and technical department and give them the sole responsibility of maintaining a service’s reliability.

2. Get the Right People

In cases where you’re ready to scale, the time might come where you’ll need to get additional help for your site reliability engineering team. SRE professionals are in hot demand nowadays. There are more than 1,300 site reliability engineering jobs on Indeed.

The key to finding the right people for your SRE team is to know what you’re looking for. Here are a few qualifications to look for in a site reliability engineer.

Problem-solving and troubleshooting skills: Much of the SRE team’s responsibilities have to do with addressing incidents and issues in software. Most times, these problems have to do with systems or applications that they didn’t create themselves. So the ability to quickly debug even without in-depth knowledge of a system is a must-have skill.
A knack for automation: Toil can often become a big problem in many tech-based services. The right site reliability engineer will look for ways to automate away the toil, reducing manual work to a minimum so that staff only deal with high-priority items.
Constant learning: As systems evolve, so will problems. So good SREs will have to keep brushing up their knowledge on systems, codes, and processes that change with time.
Teamwork: Addressing incidents will rarely be a one-man-job so SREs need to work well with teams. Collaboration and communication are the skills to look out for definitely.
Bird’s eye view perspective: When addressing bugs, it can be easy to get caught up with the wrong things when you’re stuck in the middle of it. That’s why good SREs will need the ability to see the bigger picture and find solutions in larger contexts. A successful site reliability engineer will find the root cause and create an overarching solution.

3. Define Your SLOs

An SRE team will most likely succeed with service level objectives in place. Service level objectives or SLOs are the key performance metrics for a site. SLOs can vary depending on the kind of service a business offers. Generally, any user-facing serving system will have to set availability, latency, and throughput as indicators. Storage-based systems will often place more emphasis on latency, availability, and durability.

Setting up SLOs also involves placing values that a company would like to maintain in terms of indicators. The numbers your SLOs should show are the minimum thresholds that the system should hold on to. When setting an SLO, don’t base them on current performance as this might put you in a position to meet unrealistic targets. Keep your objectives simple and avoid placing any absolutes. The fewer SLOs you have in place, the better, so only measure what indicators matter to you most.

4. Set Holistic Systems to Handle Incident Management

Incident management is one of the most important aspects of site reliability engineering. In a survey by Catchpoint, 49% of respondents said that they had worked on an incident in the last week or so. When handling incidents, a system needs to be in place to keep the debugging and maintenance process as smooth as possible.

One of the most important aspects of an incident management system is keeping track of on-call responsibilities. SRE team responsibilities can get extremely exhausting without an effective means to control the flow of on-call incidents. Using the right incident management tool can help resolve incidents with more clarity and structure.

5. Accept Failure as Part of the Norm

Most people don’t like experiencing failure, but if your company wants to maintain a healthy and productive SRE team, one of the themes that each member must get used to is accepting failure as a part of the profession. Perfection is rarely ever the case in any system, most especially when in the early development stages.

Many SRE teams mistake setting the bar too high right away and putting up unrealistic SLO definitions and targets. The best operational practice has always been to shoot for a minimum viable product and then slowly increase the parameters once the team and company as a whole build up confidence.

6. Perform Incident Postmortems to Learn from Failures and Mistakes

There’s an old saying that goes this way: “Dead men tell no tales.” But that isn’t the case with system incidents. There is much to learn from incidents even after problems have been resolved. That’s why it’s a great practice to perform incident postmortems so that SRE teams can learn from their mistakes. A proper SRE approach would take into account the best practices for postmortem.

When performing post-incident analysis, there are sets of parameters that site reliability crews must analyze. First, they should look into the cause and triggers of the failure. What caused the system to fail? Secondly, the team should pinpoint as many of the effects as they can find. What did the system failure affect? For example, a payment gateway error might have caused a discrepancy in payments made or collections, which can be a headache if left unturned for even a few days. Lastly, a successful postmortem will look into possible solutions and recommendations if a similar error might occur in the future.

7. Maintain a Simple Incident Management System

An SRE team structure isn’t enough to create a productive team. There also needs to be a project and incident management system in place. There are various services and different IT management software use cases available to SRE teams today. Some of the factors that team managers need to consider are ease of use, communication barriers, available integrations, and collaboration capabilities.

Setting Your SRE Team Up for Success

An SRE team can be likened to an aircraft maintenance crew fixing a plane while it’s 50,000 feet in the air. Setting your SRE team up for success is crucial as they will assure that your company’s service is available to your clients. While errors and bugs are inevitable in any software as a service, it can be kept to a minimum, making outages and errors a rare occasion. But for that to happen, you’ll need a solid SRE team in place, proactively finding ways to avoid errors and being ready to spring into action when duty calls.

Incident management Reliability engineering Site reliability engineering teams

Published at DZone with permission of Vishal Padghan. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending