How to run a post-incident review
A post-incident review (PIR) is a structured meeting where the team that responded to an incident reviews the timeline, confirms root cause, and assigns action items. Unlike a postmortem document (which one person drafts), the PIR meeting is a collaborative process that surfaces observations and decisions that the written record may miss. Getting both right, the document and the meeting, is what separates teams that learn from incidents from teams that repeat them.
Who should attend
The on-call engineer, the service owner, a facilitator (ideally someone not directly involved in the response), and anyone who took a meaningful action during the incident. Keep attendance to under eight people. Larger groups slow the meeting, dilute accountability, and make it harder to reach clear decisions on action items.
If the incident crossed team boundaries, one representative per involved team is enough. The goal is coverage of perspective, not exhaustive attendance.
How to run the meeting
A reliable five-step structure works for most incidents:
1. Set the tone. Open with an explicit statement: this is a blameless review focused on system and process factors, not individual performance. The facilitator enforces this throughout. If blame language surfaces, redirect it to the system question: what condition allowed this to happen?
2. Walk the timeline. Move chronologically, fact by fact. At this stage, no interpretation. Read from the incident document, alert timestamps, and chat logs. The goal is a shared, agreed-upon sequence of events. Disagreements about the timeline are significant signals; resolve them before moving forward.
3. Identify the causal chain. Once the timeline is agreed upon, work backward from the impact. Where did the system fail to prevent the issue? Where did it fail to surface it faster? Use the five-whys or a simplified fault tree. Look for contributing conditions, not just the proximate trigger.
4. Surface what went well. Explicitly identify effective practices: fast detection, a well-written runbook that shortened resolution, a team member who caught a cascading failure early. Reinforcing what worked is as important as fixing what did not.
5. Agree on action items. Every action item needs an owner, a due date, and a specific definition of done. "Improve alerting" is not an action item. "Add a latency alert at p99 > 500ms on the checkout service by June 30, owned by @dana" is. Vague action items die in backlogs.
Common failure modes of PIRs
Blame drift. Even in teams that intend to run blameless reviews, conversations slide toward individual performance. The facilitator's job is to catch this early and redirect. A useful redirect: "What would have needed to be true about the system for a different engineer to have made the same decision?"
Action item debt. Teams generate action items in every PIR but never close them. When debt accumulates, the review process loses credibility. Before closing each meeting, read back every open action item from the previous review for the same service. If items are consistently overdue, that is a capacity or prioritization problem worth naming.
Survivor bias. Many teams only review P1 incidents. P2 and batched P3 reviews surface patterns that never show up in the high-severity queue: slow degradations, repeated near-misses, detection gaps that luck avoided escalating. Review every P1 and P2 individually. For P3, a weekly batch review is enough if volume is high.
Turning PIRs into durable memory
A well-run PIR is only as valuable as the memory it creates. The failure pattern, the detection gap, and the fix should be encoded somewhere that future responders can find within the first five minutes of the next similar incident.
That means the postmortem document needs to be searchable and linked from your runbooks. The action items need to land in your incident tracking system, not a meeting doc. And the pattern, the class of failure, needs to be tagged so that trend analysis is possible over time.
Teams that invest in this encoding step find that their second similar incident resolves significantly faster than the first. The institutional memory that used to live only in the heads of senior engineers becomes accessible to the whole team.
NOFire AI's Remember pillar is built around this problem: capturing what was learned during an incident and making it retrievable at the moment it is needed next. See the AI Reliability Guide to go deeper on how AI-assisted incident memory works in practice.
Review frequency
- P1: Review every incident, within 48 hours of resolution.
- P2: Review every incident, within five business days.
- P3: Batch weekly if volume is high. Flag any P3 that was close to escalating to P2 for individual review.
Consistency matters more than perfection. A lightweight PIR done reliably builds more organizational learning than an exhaustive one done intermittently.
Frequently asked questions
- How is a PIR different from a postmortem?
- A postmortem is the written document; a PIR is the meeting that validates and extends it. Many teams use the terms interchangeably.
- How long should a PIR meeting take?
- 45-60 minutes for most incidents. Complex multi-team incidents may need 90 minutes.
- What happens if the team disagrees on root cause?
- Document the disagreement and the reasoning. Assign a follow-up investigation. Do not force consensus on a technical question.
Go deeper: the AI Reliability Guide
Book a demo