In conversations with customers and network peers, many companies are considering setting up a dedicated SRE team or possibly looking to realign existing responsibilities. According to a report from Catchpoint, 50% of organisations have dedicated SRE teams or roles, and the number of vacancies for Service Reliability engineers has increased dramatically.
This supports the belief that system reliability, performance, and availability continue to be at the top of the key drivers for establishing an improved foundation of SRE practices.
Key drivers for an SRE practice
- The scale and complexity of IT Systems are key determinants. Increasing scale and complexity undoubtedly expose much more risk.
- Operational risks are not proactively mitigated through development and tend to be reactively resolved.
- The impact of operational failure on the business is substantial in terms of revenue loss and reputation.
- The frequency and severity of production incidents are high. Development teams are spending too much time firefighting. Incident management is not fixing issues properly.
- Service-Level Objectives (SLOs) for high-priority systems either do not exist or are not measured. Actionable insights are not being generated and operational issues are not exposed proactively. Management of SLOs is not happening.
- Production monitoring and alerting are not set up properly and this leads to poor insight on performance, availability and reliability risk. Reporting is very weak. There is little or no observability in test environments.
- Development teams miss chances to improve time to market and are not taking advantage of transformative activities such as automation frameworks, testing frameworks, deployment, and Infrastructure as Code. Releases are often overrun and the release cycle is slow.
- Non-functional testing (performance/scalability/efficiency, resilience/recovery, security) is executed poorly if at all, and is not underpinned by testing frameworks.
- Cross-functional collaboration between Service Management, Operations, and Development teams is poor and the benefits of close cooperation are not realised.
If any of these factors describe operational challenges you are experiencing then it might be time to examine your organisational capability and implement a remediation plan to plug key gaps.
About the Author
Frank Warren
Frank is a Principal Consultant specialising in capacity planning, performance engineering and cloud cost optimisation. Frank leads numerous high profile ecommerce clients, helping them achieve their business peaks while savings on cloud costs and improving performance.
Also worth having a look at some of our recent case studies where we have saved our clients Millions of pounds in cloud spend.