Glossary
The minimum terms you need to read ITSM without bureaucracy-brain.
ITSM Glossary for Humans
This page is here to tune your brain for ITSM - so you read it as service reliability work, not paperwork.
The core frame
Service
A service is a repeatable ability to deliver value to a user. Not a server. Not a single feature.
Example: “Place an order”, “RPC access”, “Withdraw funds”.
Common mistake: treating a service as one microservice. A real service is usually end-to-end.
Product vs Service
A product is what you sell and evolve.
A service is what the user consumes and what you can make reliability promises about.
One product can contain many services.
Customer vs User
A customer pays/signs the contract.
A user actually uses the thing.
In B2B2C these are often different people.
Stakeholder
Anyone who needs to know what happened: support, engineering, sales, finance, security, key accounts, partners.
User lens (the only lens that matters)
User Journey
A user journey is the path from “I want to do X” to “X is done”, including steps and failure points.
Why it matters: incidents, SLAs, and priorities only make sense in the context of where the journey breaks.
Example journey:
Sign in → Select asset → Place order → Confirm → See status
Common mistake: describing the journey as UI screens. It must be a value flow.
Touchpoint
A touchpoint is a concrete interaction point: a button, an API call, a status page, an email, a webhook.
Critical User Journey (CUJ)
A critical user journey is a journey that:
- drives revenue, retention, or your main product value
- or is a contractual obligation
CUJs define what is actually “critical”.
Criticality and prioritization
Critical Service
A critical service is a service whose failure:
- blocks a CUJ
- or creates high risk (money, data, security, contracts)
Critical is not an engineering opinion. It’s business reality.
Dependency
A dependency is anything your service needs to work: payment provider, cloud, upstream RPC, database, auth, DNS.
Blast Radius
Blast radius is how much gets affected if a component or change fails.
Smaller blast radius means safer systems and safer changes.
Single Point of Failure (SPOF)
A SPOF is one thing that can take the service down by itself.
If a critical service has a SPOF, that is high-priority risk.
Service quality: SLI, SLO, SLA
SLI (Service Level Indicator)
An SLI is a measurement of service quality: availability, success rate, error rate, latency (p95/p99), freshness, throughput.
Example: “% of successful requests over 5 minutes”.
SLO (Service Level Objective)
An SLO is a target for an SLI:
“Success rate ≥ 99.9% over 30 days”.
SLO is for the team’s decisions, not for marketing.
SLA (Service Level Agreement)
An SLA is a promise to a customer plus consequences (credits/penalties) if you miss it.
Common failure: setting an SLA without good measurement and without the ability to control outcomes.
Incidents and problems
Incident
An incident is an unplanned interruption or degradation of a service, or a credible risk of that, with user/business impact.
The key is impact, not internal inconvenience.
Event vs Alert vs Incident
- Event: any signal (log line, metric point, trace, state change)
- Alert: “this might be bad” (a threshold/rule fired)
- Incident: confirmed issue with impact
Two common mistakes:
- Treating every alert as an incident
- Ignoring real impact because “monitoring didn’t page”
Severity (SEV)
Severity describes how bad it is right now (scope of impact).
SEV is reality-based, not “how stressed we feel”.
Impact vs Urgency vs Priority
- Impact: how much damage (users, money, risk)
- Urgency: how fast you need to act
- Priority: the execution order (impact + urgency)
Mistake: mixing severity and priority.
Triage
Triage is fast sorting: what broke, how bad, who owns it, and what the next move is.
Goal: stop guessing - start controlling.
Mitigation
Mitigation is an action that reduces impact quickly (rollback, disable feature, failover, rate-limit, add capacity).
Mitigation is not the same as root-cause fix.
Workaround
A workaround is a way to bypass the problem (often user-side or ops-side) until you fix it.
Root Cause
Root cause is why the system allowed the incident to happen.
Real root cause is rarely “someone made a mistake”. It’s missing guardrails, weak isolation, unsafe changes, poor detection.
Postmortem / PIR
A Post-Incident Review is the learning step: what happened, why, and what concrete changes reduce recurrence.
A good PIR ends with system changes, not a long essay.
Problem
A problem is an underlying cause behind one or more incidents, or a known risk that is likely to become an incident.
Problems live longer than incidents.
Changes (where most incidents come from)
Change
A change is anything that can affect a service: deploy, config update, migration, infra change, vendor change.
Standard vs Normal vs Emergency change
- Standard: low-risk, repeatable, pre-approved
- Normal: needs risk assessment and a plan
- Emergency: urgent to restore service or reduce immediate risk
Mistake: making everything “emergency”. That usually means weak planning and weak safety rails.
Rollback
A rollback is returning to a known good state.
If rollback is slow or impossible, your delivery pipeline is unsafe.
Ops glue (what makes it work in real life)
Ownership
Ownership means one person is accountable for the outcome right now.
If “everyone owns it”, nobody owns it.
Runbook vs Playbook
- Runbook: step-by-step “do this, check that”
- Playbook: decision-making guidance “how to act and choose”
Escalation
Escalation is pulling in more attention/resources: paging seniors, raising SEV, opening a bridge, notifying stakeholders.
On-call
On-call is the defined rotation responsible for responding within targets.
Status Page
A status page is a single source of truth for service health.
It reduces support load and increases trust through transparency.
Comms Cadence
Cadence is the fixed rhythm of updates (every 15/30/60 minutes depending on SEV).
Cadence reduces chaos even when “no new info”.
Response metrics (so it’s not vibes)
TTA / TTO / TTM / TTR
- TTA: time to acknowledge
- TTM: time to mitigate (impact reduced)
- TTR: time to recover (service normal)
If you don’t track these, you don’t know if you improved.
MTTR
MTTR is commonly used as “time to restore”, but people mix meanings.
ITSMote-friendly approach: split it into TTM (mitigation) and TTR (recovery).