Problem Management

5 min read

Eliminate recurring incidents and reduce risk without turning root cause analysis into bureaucracy.

Problem Management

Problem Management in ITSMote exists for one reason - to stop the same incidents from happening again.

If it does not reduce incident frequency, blast radius, or recovery time over time, it is waste.
Problem Management is not a reporting ritual and not a postmortem factory.


Definition

A problem is the underlying cause or systemic weakness that leads to one or more incidents, or creates a credible risk of future incidents.

A problem may exist:
- after incidents already happened
- without any incident yet (known risk)


Not a problem (usually)

  • One-off incident with no realistic chance of recurrence
  • Cosmetic bugs with no operational or user impact
  • Vague technical debt with no link to incidents or risk

If it does not affect reliability, recovery, or risk - it does not belong here.


Goal

  • Prevent repeat incidents
  • Reduce blast radius when incidents happen
  • Make incident response cheaper and faster over time

Problem Management is long-term reliability work, not real-time response.


When to open a problem

Open a problem if any of the following is true:

  • Same or similar incident happened more than once
  • Incident impact was high (SEV1, SEV2)
  • Root cause points to a systemic weakness

Rule:

If you say "this will happen again" - you need a problem ticket.


Relationship to incidents

  • Incidents restore service
  • Problems prevent recurrence

Do not block incident closure waiting for a problem to be fixed.
Incident can be Resolved/Closed while the related problem remains Open.

Link problems to all relevant incidents.


Roles

Problem Owner (mandatory, single person)

Accountable for driving the problem to a real outcome:

  • drives root cause analysis to completion (ensures it is done and documented)
  • defines prevention strategy
  • creates and tracks follow-ups
  • decides when the problem can be closed

The owner does not have to do every task personally, but is accountable for the outcome and for assigning owners to follow-ups.

Never assign “the team”.

Contributors

Engineers or stakeholders providing analysis or implementing fixes.


Lifecycle (ITSMote)

Problem Management is intentionally simple and slow-paced compared to incidents.


1) Identify

Trigger sources:
- recurring incidents
- SEV1 / high-impact SEV2
- operational risk review
- incident follow-ups / PIR outputs

Do:
- Create problem ticket
- Link related incidents
- Assign Problem Owner

Status:
- Open


2) Analyze (find the real cause)

Goal: understand why this keeps happening or can happen.

Rules:
- Facts over stories
- Systems over individuals
- “Human error” is never a root cause by itself

Typical analysis angles:
- missing guardrails
- unsafe defaults
- weak detection
- unclear ownership
- overloaded components
- fragile dependencies
- manual or undocumented procedures

Record:
- Clear root cause statement
- Evidence (logs, metrics, timelines, configs)

Status:
- Analyzing


3) Control (reduce risk now, if needed)

Optional but important.

Use if:
- The fix will take time
- Risk is high
- Another incident is likely before full resolution

Examples:
- add monitoring or alerts
- add runbook
- add temporary limits or safeguards
- document known failure modes

Status:
- Controlled


4) Fix (eliminate or reduce the cause)

Goal: implement changes that actually reduce future incidents.

Rules:
- Fewer, higher-quality actions beat long lists
- Each action must change the system, not just describe it

Examples:
- automation instead of manual steps
- validation instead of tribal knowledge
- architectural simplification
- safer deployment patterns
- better defaults

All actions must have:
- single owner
- measurable outcome
- due date

Status:
- In Progress


5) Verify

Goal: confirm the problem is truly addressed.

Verify by:
- absence of repeat incidents over time
- improved metrics (errors, latency, recovery)
- successful simulations or tests

Do not close based on hope.

Status:
- Verifying


6) Close

Close only when:
- Root cause is documented
- Preventive actions are completed
- Risk is demonstrably reduced or eliminated
- Related incidents are linked

Final status:
- Closed


Known Error (optional)

Use a Known Error record if:
- Root cause is understood
- Fix is deferred or risky
- Workaround exists

Known Error must include:
- clear symptoms
- impact scope
- safe workaround
- conditions to escalate
- owner
- review date (or expiry date)

If nobody uses the workaround, the record is useless.


Metrics (minimum viable)

If Problem Management does not change these, it is not working.

Track quarterly:

  • Repeat incident rate (same root cause)
  • % problems with completed follow-ups
  • Time from problem open to verified fix
  • SEV1 incidents caused by known problems

Documentation standard

Keep it short and actionable.

Minimum problem record fields:

  • Problem summary
  • Related incidents
  • Root cause (clear and specific)
  • Risk description
  • Preventive actions (with owners and dates)
  • Verification method

Rule:

If a new engineer can’t understand the problem in 5 minutes, the doc is bad.


Minimal template

Problem header (copy/paste)

  • Title:
  • Problem Owner:
  • Linked incidents:
  • Root cause:
  • Risk:
  • Current status:
  • Preventive actions:
  • Verification plan: