Problem Management

5 min read

Eliminate recurring incidents and reduce risk without turning root cause analysis into bureaucracy.

Problem Management

Problem Management in ITSMote exists for one reason - to stop the same incidents from happening again.

If it does not reduce incident frequency, blast radius, or recovery time over time, it is waste.
Problem Management is not a reporting ritual and not a postmortem factory.

Definition

A problem is the underlying cause or systemic weakness that leads to one or more incidents, or creates a credible risk of future incidents.

A problem may exist:
- after incidents already happened
- without any incident yet (known risk)

Not a problem (usually)

One-off incident with no realistic chance of recurrence
Cosmetic bugs with no operational or user impact
Vague technical debt with no link to incidents or risk

If it does not affect reliability, recovery, or risk - it does not belong here.

Goal

Prevent repeat incidents
Reduce blast radius when incidents happen
Make incident response cheaper and faster over time

Problem Management is long-term reliability work, not real-time response.

When to open a problem

Open a problem if any of the following is true:

Same or similar incident happened more than once
Incident impact was high (SEV1, SEV2)
Root cause points to a systemic weakness

Rule:

If you say "this will happen again" - you need a problem ticket.

Relationship to incidents

Incidents restore service
Problems prevent recurrence

Do not block incident closure waiting for a problem to be fixed.
Incident can be Resolved/Closed while the related problem remains Open.

Link problems to all relevant incidents.

Roles

Problem Owner (mandatory, single person)

Accountable for driving the problem to a real outcome:

drives root cause analysis to completion (ensures it is done and documented)
defines prevention strategy
creates and tracks follow-ups
decides when the problem can be closed

The owner does not have to do every task personally, but is accountable for the outcome and for assigning owners to follow-ups.

Never assign “the team”.

Contributors

Engineers or stakeholders providing analysis or implementing fixes.

Lifecycle (ITSMote)

Problem Management is intentionally simple and slow-paced compared to incidents.

1) Identify

Trigger sources:
- recurring incidents
- SEV1 / high-impact SEV2
- operational risk review
- incident follow-ups / PIR outputs

Do:
- Create problem ticket
- Link related incidents
- Assign Problem Owner

Status:
- Open

2) Analyze (find the real cause)

Goal: understand why this keeps happening or can happen.

Rules:
- Facts over stories
- Systems over individuals
- “Human error” is never a root cause by itself

Typical analysis angles:
- missing guardrails
- unsafe defaults
- weak detection
- unclear ownership
- overloaded components
- fragile dependencies
- manual or undocumented procedures

Record:
- Clear root cause statement
- Evidence (logs, metrics, timelines, configs)

Status:
- Analyzing

3) Control (reduce risk now, if needed)

Optional but important.

Use if:
- The fix will take time
- Risk is high
- Another incident is likely before full resolution

Examples:
- add monitoring or alerts
- add runbook
- add temporary limits or safeguards
- document known failure modes

Status:
- Controlled

4) Fix (eliminate or reduce the cause)

Goal: implement changes that actually reduce future incidents.

Rules:
- Fewer, higher-quality actions beat long lists
- Each action must change the system, not just describe it

Examples:
- automation instead of manual steps
- validation instead of tribal knowledge
- architectural simplification
- safer deployment patterns
- better defaults

All actions must have:
- single owner
- measurable outcome
- due date

Status:
- In Progress

5) Verify

Goal: confirm the problem is truly addressed.

Verify by:
- absence of repeat incidents over time
- improved metrics (errors, latency, recovery)
- successful simulations or tests

Do not close based on hope.

Status:
- Verifying

6) Close

Close only when:
- Root cause is documented
- Preventive actions are completed
- Risk is demonstrably reduced or eliminated
- Related incidents are linked

Final status:
- Closed

Known Error (optional)

Use a Known Error record if:
- Root cause is understood
- Fix is deferred or risky
- Workaround exists

Known Error must include:
- clear symptoms
- impact scope
- safe workaround
- conditions to escalate
- owner
- review date (or expiry date)

If nobody uses the workaround, the record is useless.

Metrics (minimum viable)

If Problem Management does not change these, it is not working.

Track quarterly:

Repeat incident rate (same root cause)
% problems with completed follow-ups
Time from problem open to verified fix
SEV1 incidents caused by known problems

Documentation standard

Keep it short and actionable.

Minimum problem record fields:

Problem summary
Related incidents
Root cause (clear and specific)
Risk description
Preventive actions (with owners and dates)
Verification method

Rule:

If a new engineer can’t understand the problem in 5 minutes, the doc is bad.

Minimal template

Problem header (copy/paste)

Title:
Problem Owner:
Linked incidents:
Root cause:
Risk:
Current status:
Preventive actions:
Verification plan: