SLO Bootstrap Guide#

Red Hat’s SIG-SRE group has created a set of documents which share how some teams engage with the practice of SLOs inside Red Hat. This guide is meant to be a self-contained bootstrapping guide for teams looking to use the site reliability engineering approach to supporting services. It is our hope that these supporting documents can assist others. Before we get started, there are some definitions to ensure common understanding of terms and phrases that are used throughout the package.

Steps#

Step 1: Familiarise Definitions#

Service Level Objective (SLO) - A Service Level Objective is a specific, measurable target to express the level of service provided to a service’s customers, be they internal or external. These objectives can be focused on a particular aspect of the service, such as a specific API endpoint.

Service Level Indicator (SLI) - A Service Level Indicator (SLI) is a specific metric that is used to measure against the objective. These indicators may be exposed through the service (“white box”) itself or measured in a way that is external to the service (“black box”).

Service Level Agreement (SLA) - A Service Level Agreement is similar to a SLO, but is usually a legal document which imposes financial obligations from the service provider to their customers when the agreed service performance target is unmet.

Persona - Throughout these documents, the word “persona” is used to describe a kind of person who holds a specific role within a team or organisation. See also Personas Related to Managed Services.

Service - A service is a discrete software that is providing some value to customers (internal or external).

Step 2: Personas#

The first document to familiarise with, aside from this one, is the Personas Related to Managed Services. Many of the other documents make reference to the described personas.

Step 3: SLO Lifecycle#

After reviewing the personas, the SLO Lifecycle document describes the three phases of living with SLOs: Research, for SLO ideation and gap analysis; Implementation, where changes to a service’s codebase and service team’s processes change to support the SLO; Iteration, for any changes to the SLO or processes after implementation. It is important to know that SLOs are living things and can (and should) be adapted to the team’s lived experiences.

Step 3.1: Picking Good SLOs and SLIs#

Two documents directly relate to the SLO Lifecycle’s Research phase: Picking Good SLIs and Picking Good SLOs. A third document supports the Iteration phase as it describes Error Budget Policy.

Step 4: RACI Chart#

The SLO Lifecycle document suggests a number of responsibilities for various personas and the SLO RACI Chart aims to be a quick reference for who is responsible for what using a responsibility assignment matrix pattern.

Step 5: SLO Reporting#

Finally, the SLO Reporting document can be an aid for reporting on the SLO hits and misses to key external stakeholders.