DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Enterprise AI Trend Report: Gain insights on ethical AI, MLOps, generative AI, large language models, and much more.

2024 Cloud survey: Share your insights on microservices, containers, K8s, CI/CD, and DevOps (+ enter a $750 raffle!) for our Trend Reports.

PostgreSQL: Learn about the open-source RDBMS' advanced capabilities, core components, common commands and functions, and general DBA tasks.

AI Automation Essentials. Check out the latest Refcard on all things AI automation, including model training, data security, and more.

Related

  • Too Many Tools? Streamline Your Stack With AIOps
  • Scaling SRE Teams: The Challenges and How To Build a Successful Scaling Framework
  • The Rise of the Data Reliability Engineer
  • Human-Centered Approach to Service Reliability: Building Culture, Communication, and Collaboration

Trending

  • Organizing Knowledge With Knowledge Graphs: Industry Trends
  • Getting Started With NCache Java Edition (Using Docker)
  • Data Processing in GCP With Apache Airflow and BigQuery
  • Being a Backend Developer Today Feels Harder Than 20 Years Ago
  1. DZone
  2. Testing, Deployment, and Maintenance
  3. Monitoring and Observability
  4. 7 Tips for Building and Maintaining an SRE Team in Your Company

7 Tips for Building and Maintaining an SRE Team in Your Company

In today's "always on" world, reliability is a primary business KPI. Establish a culture of reliability by implementing these 7 simple tips to build a solid SRE team in your organization.

By 
Vishal Padghan user avatar
Vishal Padghan
·
Jan. 10, 24 · Opinion
Like (2)
Save
Tweet
Share
2.1K Views

Join the DZone community and get the full member experience.

Join For Free

Many of today’s hottest jobs didn’t exist at the turn of the millennium. Social media managers, data scientists, and growth hackers were unheard of at the time. Another relatively new job role in demand is that of a Site Reliability Engineer or SRE. The profession is quite new. It’s noted that 64% of SRE teams are less than three years old. But despite being new, the job role adds a lot of value to an organization.

SRE vs. DevOps 

Site reliability engineering is the merging of development and operations into one. Most people tend to mix up SRE and DevOps. By principle, the two intertwine, but DevOps serves as the principle and SRE the practice. 

Any company looking to implement site reliability engineering in their organization might want to start by following these seven tips to build and maintain an SRE team. 

1. Start Small and Internally 

There is a high chance that your company needs an SRE team but doesn’t need a whole department right away. Site reliability management’s role is to ensure that an online service remains reliable through alert creation, incident investigation, root cause remediation, and incident postmortem. 

The average tech-based company faces a few bugs every so often. In the past, operations and development teams would come together to fix those issues in software or service. An SRE approach merges those two into one. 

If you’re just starting to build your SRE team, you can start by putting together some people from your operations and technical department and give them the sole responsibility of maintaining a service’s reliability. 

2. Get the Right People 

In cases where you’re ready to scale, the time might come where you’ll need to get additional help for your site reliability engineering team. SRE professionals are in hot demand nowadays. There are more than 1,300 site reliability engineering jobs on Indeed. 

The key to finding the right people for your SRE team is to know what you’re looking for. Here are a few qualifications to look for in a site reliability engineer.

  • Problem-solving and troubleshooting skills: Much of the SRE team’s responsibilities have to do with addressing incidents and issues in software. Most times, these problems have to do with systems or applications that they didn’t create themselves. So the ability to quickly debug even without in-depth knowledge of a system is a must-have skill. 
  • A knack for automation: Toil can often become a big problem in many tech-based services. The right site reliability engineer will look for ways to automate away the toil, reducing manual work to a minimum so that staff only deal with high-priority items. 
  • Constant learning: As systems evolve, so will problems. So good SREs will have to keep brushing up their knowledge on systems, codes, and processes that change with time. 
  • Teamwork: Addressing incidents will rarely be a one-man-job so SREs need to work well with teams. Collaboration and communication are the skills to look out for definitely. 
  • Bird’s eye view perspective: When addressing bugs, it can be easy to get caught up with the wrong things when you’re stuck in the middle of it. That’s why good SREs will need the ability to see the bigger picture and find solutions in larger contexts. A successful site reliability engineer will find the root cause and create an overarching solution.

3. Define Your SLOs 

An SRE team will most likely succeed with service level objectives in place. Service level objectives or SLOs are the key performance metrics for a site. SLOs can vary depending on the kind of service a business offers. Generally, any user-facing serving system will have to set availability, latency, and throughput as indicators. Storage-based systems will often place more emphasis on latency, availability, and durability. 

Setting up SLOs also involves placing values that a company would like to maintain in terms of indicators. The numbers your SLOs should show are the minimum thresholds that the system should hold on to. When setting an SLO, don’t base them on current performance as this might put you in a position to meet unrealistic targets. Keep your objectives simple and avoid placing any absolutes. The fewer SLOs you have in place, the better, so only measure what indicators matter to you most. 

4. Set Holistic Systems to Handle Incident Management 

Incident management is one of the most important aspects of site reliability engineering. In a survey by Catchpoint, 49% of respondents said that they had worked on an incident in the last week or so. When handling incidents, a system needs to be in place to keep the debugging and maintenance process as smooth as possible. 

One of the most important aspects of an incident management system is keeping track of on-call responsibilities. SRE team responsibilities can get extremely exhausting without an effective means to control the flow of on-call incidents. Using the right incident management tool can help resolve incidents with more clarity and structure.

5. Accept Failure as Part of the Norm 

Most people don’t like experiencing failure, but if your company wants to maintain a healthy and productive SRE team, one of the themes that each member must get used to is accepting failure as a part of the profession. Perfection is rarely ever the case in any system, most especially when in the early development stages. 

Many SRE teams mistake setting the bar too high right away and putting up unrealistic SLO definitions and targets. The best operational practice has always been to shoot for a minimum viable product and then slowly increase the parameters once the team and company as a whole build up confidence. 

6. Perform Incident Postmortems to Learn from Failures and Mistakes 

There’s an old saying that goes this way: “Dead men tell no tales.” But that isn’t the case with system incidents. There is much to learn from incidents even after problems have been resolved. That’s why it’s a great practice to perform incident postmortems so that SRE teams can learn from their mistakes. A proper SRE approach would take into account the best practices for postmortem. 

When performing post-incident analysis, there are sets of parameters that site reliability crews must analyze. First, they should look into the cause and triggers of the failure. What caused the system to fail? Secondly, the team should pinpoint as many of the effects as they can find. What did the system failure affect? For example, a payment gateway error might have caused a discrepancy in payments made or collections, which can be a headache if left unturned for even a few days. Lastly, a successful postmortem will look into possible solutions and recommendations if a similar error might occur in the future. 

7. Maintain a Simple Incident Management System 

An SRE team structure isn’t enough to create a productive team. There also needs to be a project and incident management system in place. There are various services and different IT management software use cases available to SRE teams today. Some of the factors that team managers need to consider are ease of use, communication barriers, available integrations, and collaboration capabilities. 

Setting Your SRE Team Up for Success 

An SRE team can be likened to an aircraft maintenance crew fixing a plane while it’s 50,000 feet in the air. Setting your SRE team up for success is crucial as they will assure that your company’s service is available to your clients. While errors and bugs are inevitable in any software as a service, it can be kept to a minimum, making outages and errors a rare occasion. But for that to happen, you’ll need a solid SRE team in place, proactively finding ways to avoid errors and being ready to spring into action when duty calls.

Incident management Reliability engineering Site reliability engineering teams

Published at DZone with permission of Vishal Padghan. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Too Many Tools? Streamline Your Stack With AIOps
  • Scaling SRE Teams: The Challenges and How To Build a Successful Scaling Framework
  • The Rise of the Data Reliability Engineer
  • Human-Centered Approach to Service Reliability: Building Culture, Communication, and Collaboration

Partner Resources


Comments

ABOUT US

  • About DZone
  • Send feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: