How long should a postmortem take to write?

30-60 minutes for the initial draft while the incident is fresh. Reserve a separate meeting for the timeline review and action item assignment.

Who should own the postmortem?

The on-call engineer who led the response, with input from all involved. Not a manager.

What is the difference between a postmortem and a retrospective?

A postmortem is incident-specific and focuses on root cause and recurrence prevention. A retrospective is a team-level process review, broader and periodic.

Incident postmortem template

An incident postmortem (also called a post-incident review) is a structured document that captures what happened, why, and what to do differently. A blameless postmortem focuses on system factors and process gaps, not individual fault. The goal is durable learning that prevents recurrence.

Template (copy-paste ready)

Use this template immediately after an incident is resolved, while memory is fresh. Fill in every section before the review meeting so discussion time is spent on action items, not reconstruction.

Incident postmortem header

Field	Value
Date
Severity	SEV-1 / SEV-2 / SEV-3
Duration	e.g., 47 minutes
Services affected
Responders
Postmortem owner

Summary

One to three sentences describing what happened, the user-visible impact, and the fix applied. Written for someone who was not in the incident channel.

Timeline

Time (UTC)	Event
HH:MM	First alert fired
HH:MM	On-call acknowledged
HH:MM	Hypothesis formed
HH:MM	Mitigation applied
HH:MM	Service restored
HH:MM	Incident closed

Add rows as needed. Keep entries factual, one sentence each.

Root cause

Describe the technical root cause and the contributing conditions that allowed it to reach production. A useful test: if you removed this cause, would the incident have occurred? If yes, you have the root cause. If the system would still have failed, go one level deeper.

Impact

Users affected: (number or percentage of traffic, if known)
Errors observed: (error rate, error count)
Data at risk: (yes / no, scope)
SLA breached: (yes / no, by how much)

What went well

List two to five things that worked. Detection speed, communication, tooling that surfaced the right signal, a runbook that was accurate. This section is not filler; it tells you what to preserve and replicate.

What could be better

List the gaps in detection, diagnosis, communication, or recovery that made this harder than it needed to be. Be specific. "Alerting was slow" is not useful. "The latency alert has a 10-minute evaluation window; a 3-minute window would have fired 7 minutes earlier" is.

Action items

Action	Owner	Due

Each action item should be specific enough to close as a pull request or a ticket. "Improve monitoring" is not an action item. "Add a p99 latency alert for the payments service with a 3-minute window and a 200ms threshold" is.

Blameless framing

The blameless postmortem principle comes from the observation that engineers make decisions based on the information available at the time. Punishing individuals for decisions that seemed reasonable under uncertainty teaches people to hide failures, not prevent them.

The practical shift: reframe every finding as a system or process question.

Instead of "the engineer deployed without running load tests," ask "what made it possible to deploy without load tests? Was the step skipped in the runbook? Was the CI gate absent? Was the engineer under time pressure that the system created?"

The answer to those questions produces an action item that fixes the system. The answer to the blame question produces an apology that fixes nothing.

A few framing checks before you publish:

Does any sentence name an individual in a negative context? Rewrite it to name the system condition instead.
Does the root cause section describe a decision someone made, or the conditions that made that decision available? Aim for conditions.
Would the engineer involved feel comfortable sharing this document publicly? If not, the framing needs work.

Blameless does not mean consequence-free. It means the postmortem itself is focused on the system. Personnel decisions, if any, happen separately.

Turning postmortems into memory

A postmortem that sits in a doc is documentation, not memory. Documentation degrades. The next on-call who faces a similar failure will not search the postmortem archive at 2am. They will grep logs, check dashboards, and guess.

Learning compounds only when two things happen. First, the action items ship. Track them in your issue tracker with a link back to the postmortem. Review open postmortem items in every sprint until they close. If an action item is still open 90 days later, either prioritize it or explicitly accept the risk.

Second, the failure pattern gets encoded where the next responder will find it. That means runbooks updated with the new diagnosis path, alert descriptions that reference the postmortem, and playbooks that capture the fix steps. The postmortem is the source; the runbook is the delivery mechanism.

Teams that close the loop between postmortems and runbooks build institutional memory. Teams that treat them as separate artifacts repeat the same incidents on a 12-to-18-month cycle, each time with a fresh postmortem and the same unshipped action items.

Automating root-cause identification is one way to shrink the gap between incident and insight. NOFire AI surfaces the most probable root cause during the incident itself, which means the postmortem starts with a verified hypothesis rather than a blank page. See the AI Reliability Guide to go deeper on how AI-assisted RCA fits into a modern reliability practice.