AI Incident Response Runbook + Escalation Matrix

Quick Recap: Most banks have incident response for infrastructure but not for AI failures. A model making biased decisions at scale needs a different playbook than a database outage. Here's how to build one that actually works when regulators are watching.

The Conversation Nobody Wants to Have

Your credit risk model has been running perfectly for 18 months. Then one Tuesday, confidence drops 25% overnight. Not a catastrophe—the systems still work, APIs respond fine, no bugs in the logs. But the model is less sure about its decisions.

Your ML lead's first question: "Should we halt it?"

Your Risk Manager's first question: "Have we notified the regulator?"

Your Compliance Officer's first question: "Is this a fairness issue? Did something change in how we're approving by demographic?"

Your General Counsel's first question: "What do we tell customers who were denied based on this model?"

Notice the problem? Four different stakeholders, four different priorities, no shared playbook.

Here's what most banks do: Emergency Slack call. Someone says "Let's investigate." Someone else says "Maybe we should pause approvals?" Compliance officer is Googling whether this requires regulatory notification. Thirty minutes later, you've made zero decisions but everyone's stressed.

Banks that have figured this out have one thing: A pre-built escalation matrix that removes the guesswork. When alert fires → Classification happens → Action is predetermined. No judgment calls at 10 AM on a Tuesday.

Why AI Incidents Demand Different Responses

Infrastructure incidents are binary. Your database is down or it's up. Your API is responding or it's not. You fix it, verify it works, restore service. Clear cause, clear effect, clear resolution.

AI incidents live in the gray zone.

The model might be running perfectly fine from a systems perspective. Database is healthy. API responds in 45ms. No errors in logs. But the model is making systematically worse decisions. Or it's treating different demographic groups unfairly. Or it's so unconfident that it's basically useless. None of these are system failures. All of them are business failures.

Why this matters in BFSI: Fed guidance (2025) requires banks to identify material AI issues within 24 hours. But here's the kicker—infrastructure incidents you discover because users start complaining. AI incidents can run silently for weeks. A credit model approving loans it shouldn't? Nobody notices until defaults spike six months later. A fraud model getting less accurate? The fraud team thinks market conditions changed. An embeddings model drifting? Your search results just get worse, gradually.

The other difference: Infrastructure incidents have one audience (ops teams). AI incidents have multiple audiences simultaneously:

Compliance (fairness implications?)
Risk (financial exposure?)
Legal (customer liability?)
Regulators (notification required?)
Business (revenue impact?)

All asking different questions. All needing different answers. All on different timelines.

The Three-Layer Detection System

You can't respond to incidents you don't see. That means automated detection with clear severity signals.

Layer 1: What Gets Monitored

For each production model:

Performance metrics: Accuracy, precision, recall, AUC-ROC (is the model still discriminating correctly?)
Confidence metrics: Average prediction confidence (is the model sure about decisions?)
Distribution shifts: Input feature distributions (are we getting different applicants than in training?)
Fairness metrics: Approval rates by demographic, accuracy by demographic (is something becoming unfair?)
Output drift: Prediction distributions (suddenly approving way more? Or way less?)
Latency: Inference speed (is the model slowing down?)

Layer 2: Alert Thresholds

Not all changes are equal. You need severity levels:

GREEN (Normal):
- Accuracy: 88-92%
- Confidence: 75-85%
- Fairness disparity: <2%
- Drift score: <0.05 (minimal change)

YELLOW (Watch):
- Accuracy: 80-88% or 92-96%
- Confidence: 70-75% or 85-90%
- Fairness disparity: 2-5%
- Drift score: 0.05-0.10

RED (Act):
- Accuracy: <80%
- Confidence: <70%
- Fairness disparity: >5%
- Drift score: >0.10

Why this matters: You can't alert on every 1% change. You need signal vs. noise separation. Green is normal statistical variation. Yellow is "pay attention, something's shifting." Red is "stop everything and investigate."

Layer 3: Classification Framework

When an alert fires, you don't immediately panic. You classify the incident on three axes to determine escalation speed.

Axis 1: Autonomy Level (How much does the model decide?)

Level 1: Decision support only. Humans review every decision.
Level 2: Recommended decision. Humans typically approve, but model drives the recommendation.
Level 3: Autonomous decision. Model decides, humans see the outcome logged after the fact.
Level 4: Autonomous + irreversible. Model decides, decision executes immediately, hard to reverse.

Axis 2: Impact Scope (How many decisions/customers?)

Scope A: <1% of daily decisions
Scope B: 1-10% of daily decisions
Scope C: 10-50% of daily decisions
Scope D: >50% of daily decisions

Axis 3: Decision Criticality (How important is this decision?)

Critical: Credit approvals, capital allocations, sanctions screening
High: Fraud alerts, transaction monitoring, anti-money laundering holds
Medium: Customer service routing, product recommendations
Low: UI personalization, content ranking

The combination determines your action:

Level 4 + Scope D + Critical = STOP IMMEDIATELY
Level 4 + Scope C + Critical = HALT WITHIN 1 HOUR
Level 3 + Any + Critical = NOTIFY RISK & LEGAL (within 2 hours)
Level 2 + Any + Any = FLAG & MONITOR (next business day investigation)
Level 1 + Any + Any = LOG & INVESTIGATE (standard priority)

The Escalation Matrix (The Actual Decision Rules)

Here's where the magic happens. This is the table you print, laminate, and put in every incident war room.

Detection Level	Autonomy	Scope	Criticality	Primary Action	Who Gets Called	Timeline	Key Documentation
Red	Level 4	D	Critical	HALT model immediately. Switch to manual decisions or previous model version.	IC, ML Lead, Risk VP, Compliance	0-15 min	Incident ticket, action taken, who approved halt
Red	Level 4	C	Critical	PAUSE autonomous decisions. Audit recent predictions before resuming.	ML Lead, Risk Manager	Within 1 hour	Root cause hypothesis, audit results
Red	Level 4	B	Critical	PAUSE new decisions. Investigate root cause same business day.	ML Lead, Risk	Within 4 hours	Investigation summary, remediation proposal
Red	Level 3+	Any	High	NOTIFY Risk & Compliance. Prepare for customer communication if decisions are reversible.	Risk Committee, Compliance, Legal	Within 2 hours	Incident report, customer communication draft
Yellow	Level 4	Any	Critical	INVESTIGATE same shift. Don't halt unless it escalates to Red.	ML Lead, Data Engineer	Within 8 hours	Technical investigation memo
Yellow	Level 3	Any	Any	MONITOR closely. Schedule proper investigation next business day.	ML Lead	Next business day	Investigation ticket, priority flagged
Yellow	Level 1-2	Any	Any	LOG incident. Standard priority investigation.	ML Team	Standard	Incident log entry

The Runbook: What People Actually Do

When the escalation matrix says "Call Risk VP," what happens next? Here's the step-by-step that actually works:

For Incident Commander (First Responder)

When paged (Do this in order, in parallel):

Verify the alert is real (2 minutes)
- Log into monitoring dashboard
- Confirm the metric change yourself (not a dashboard bug)
- Note the exact time the alert fired vs. when the issue started
- Document in Slack: "Alert verified at 10:47 AM. Confidence dropped from 88% to 63% starting ~10:30 AM"
Gather the facts (3 minutes)
- What model? (name, version hash, environment)
- What changed? (confidence? fairness? accuracy?)
- How long has it been running this way?
- How many decisions affected? (Use Scope A-D)
- Is the model still running or paused?
- Document in template:

   MODEL: [Name] | VERSION: [Git hash or version number]
   METRIC: [Confidence/Fairness/Accuracy] changed by [amount]%
   SINCE: [Time]
   SCOPE: [A/B/C/D - est. number of affected decisions]
   STATUS: [Still running / Paused / Halted]

Classify using the matrix (2 minutes)
- Determine autonomy level (is this a Level 3 or Level 4 model?)
- Determine scope (A/B/C/D)
- Determine criticality (Critical/High/Medium/Low)
- Look up the row in the escalation matrix
- Note the timeline and who to call
Execute the immediate action (1 minute)
- Red + Critical: Don't ask permission. Halt the model now.
- Red + High: Notify stakeholders while investigating.
- Yellow: Notify but don't halt.
- Green: Log it.
Notify stakeholders (2 minutes)
- Post in #ai-incident-response Slack channel with this format:

   🚨 AI INCIDENT: [Model Name]
   Severity: [Red/Yellow/Green] | Level [1-4] | Scope [A-D] | [Criticality]
   Detection: [What changed - confidence/fairness/accuracy]
   Action: [Halted/Paused/Investigating/Logged]
   Who's involved: [IC, ML Lead, Risk Manager, Compliance]
   Next update in: 15 minutes

Success metric: Situation report ready, stakeholders notified, decision made, within 15 minutes of alert.

For ML Lead (Technical Investigator)

When called on the incident (parallel with IC's actions):

Get situation context from IC (2 minutes)
- Alert type, severity, scope, action taken
- Ask: "Is this halted already?"
- Set up war room (Zoom or Slack thread for coordination)
Investigate the three layers immediately (15-45 minutes): Data layer (Did input change?):
- Query: Feature statistics from last 24 hours vs. baseline
- Look for: Nulls spike, outliers, distribution shifts
- Talk to data ops: "Did anything change upstream?"
- Example: "Feature X is now null 40% of the time" = data quality issue
Model layer (Did the model change?):
- Check: Recent deployments/retrains? Model code changes?
- Check: Feature calculation pipeline working?
- Look for bugs that shipped yesterday or changes that went live
- Example: "Model v2.1 deployed 6 hours ago, confidence dropped 2 hours later" = timing correlates
Output layer (What's actually happening?):
- Sample recent predictions (10-20 examples)
- Compare to baseline from 1 week ago
- Check decision thresholds/configs
- Example: "Threshold was at 65%, someone changed it to 55%" = configuration issue
Form a hypothesis (by 30-minute mark):
- Write one sentence: "The model itself is fine. Feature X went null due to upstream system change."
- Or: "Model v2.1 has a bug in feature scaling."
- Or: "Market conditions shifted. Model is correctly less confident."
Propose remediation (by 45-minute mark):
- If data issue: "Filter null values" or "Revert upstream change"
- If model issue: "Rollback to v2.0" or "Hotfix feature scaling"
- If expected behavior: "Raise confidence threshold? Accept new distribution?"
Update status (every 30 minutes):
- Post in war room: "Investigating [data/model/output]. Hypothesis: [one sentence]. ETA: [time]"

Success metric: Hypothesis with root cause + proposed fix within 1 hour. Confidence that the fix is right.

For Risk Manager (Business Impact & Governance)

When called on critical incidents:

Assess business impact (15 minutes):
- How many customers affected? (Scope A = 100, Scope B = 1000, etc.)
- What type of decisions? (Credit = high impact, fraud = different impact)
- Are decisions reversible? (Can we call back denials? Can we refund charges?)
- Worst-case financial impact?
Determine regulatory obligation (within 30 minutes):
- Is this material? (Fed considers material = potential regulatory concern)
- Does it require 24-hour notification? (Only if material + autonomous decision)
- Do we need customer communication? (If decisions were unfair/discriminatory)
- Decision: Notify regulator now? Monitor and update within 24 hours? No notification needed?
Approve or reject remediation (within 1-2 hours):
- ML Lead proposes a fix
- Risk Manager evaluates: Is this fix acceptable?
- Question: Does it restore fairness? Does it fix the accuracy problem?
- Options: Approve → Resume model. Reject → Rollback or escalate further.
- For critical incidents, you might require: "Third-party validation before resume."
Document decisions (within same day):
- Create incident record (date, model, issue, root cause, action taken)
- Update model risk register
- Schedule post-mortem meeting

Common Scenarios & Exact Responses

Scenario 1: Confidence Drops 30%

Alert fires: Red | Level 4 | Scope B | Critical

IC action: Pause new approvals immediately (5 min) ML investigation: Check for data quality issues or recent changes (30 min) Hypothesis: Feature X null rate jumped from 5% to 35% Root cause: Upstream system schema change (missing field) Remediation: Apply data filter (exclude records with null Feature X), resume Timeline: Full resolution by 1 hour

Risk assessment: No regulatory notification (data quality fix is operational, not material)

Scenario 2: Fairness Disparity Jumps from 2% to 8%

Alert fires: Red | Level 4 | Scope C | Critical

IC action: Pause new approvals immediately (5 min) ML investigation: Run demographic breakdown analysis (30 min) Discovery: Approval rate for Female applicants dropped from 50% to 40%. Male stayed at 50%. Root cause: Retraining happened yesterday on data with more defaults in Female segment (legitimate risk or bias?) Risk assessment: Potential discrimination issue. Notify Legal immediately. Regulatory angle: This requires 24-hour notification to Fed if not resolved. Decision: Revert model to v2.0 (knew it was fair) OR retrain with fairness constraints Timeline: Critical resolution, legal approval required before resume

Scenario 3: Accuracy Drops Slowly (Caught in Monitoring)

Alert fires: Yellow | Level 3 | Scope B | Medium

IC action: Notify ML Lead, flag for investigation (doesn't require immediate halt) ML investigation: Compare model accuracy on recent data vs. training data Discovery: Model accuracy on data from past 30 days is 89%, was 94% in training Root cause: Market conditions shifted (different applicant profile, not model bug) Decision: This is expected. But do we keep it running at 89% or retrain? Timeline: Investigation by end of day, decision by next morning

Building the Runbook: Implementation Checklist

Step 1: Document Your Models (Weeks 1-2)

For each production model, you need:

Model name, version, what it does
Autonomy level (1-4)
Impact scope (A-D)
Criticality level (Critical/High/Medium/Low)
Owner (who owns it day-to-day?)
Key metrics to monitor
Alert thresholds (Green/Yellow/Red)
Backup/fallback option (what happens if model fails?)
Table goes into model registry accessible during incident

Step 2: Build Automated Detection (Weeks 3-6)

Set up monitoring (Prometheus/Grafana for metrics)
Configure alerts (PagerDuty, email, Slack integration)
Test alerts with dry runs
Ensure stakeholders can access dashboards in real-time
Establish alert delivery (who gets paged? Slack? Email? Phone?)

Step 3: Create Incident Response Infrastructure (Weeks 7-8)

Slack channel: #ai-incident-response (monitored 24/7)
Runbook in shared wiki (not someone's Notion page)
On-call rotation with clear handoffs
Escalation contact list (who to call if things get worse?)
War room setup (Zoom, Slack, docs all ready to go)

Step 4: Run War Games (Week 9)

Simulate incidents quarterly
Test escalation matrix in real scenario
Find gaps ("Oh, we never defined who Risk Manager is")
Update runbook based on learnings
Train new team members by doing a simulation

Step 5: Train & Document (Week 10)

All on-call team members get trained (2-hour session)
Walk through each role's responsibilities
Practice the decision-making under time pressure
Create quick-reference cards for each role
Share post-mortem learnings with team

📊 INFOGRAPHIC PROMPT 4: Runbook Checklist - From Alert to Resolution

What to show: Single-page action checklist for incident commander.

Layout: 800x1000px, vertical checklist format (laminate-ready)

Sections (top to bottom):

Phase 1: Detection (0-5 min)

☐ Alert received, verify it's real
☐ Note time alert fired and time issue started
☐ Get model name, version, what changed
☐ Estimate scope (A/B/C/D) and autonomy level

Phase 2: Classification (5-10 min)

☐ Determine autonomy level (1/2/3/4) — look up in model registry
☐ Determine criticality (Critical/High/Medium/Low) — look up in model registry
☐ Cross-reference escalation matrix for action
☐ Execute action (Halt/Pause/Investigate/Log)

Phase 3: Notification (10-15 min)

☐ Post in #ai-incident-response with template format
☐ Call team members based on escalation matrix
☐ Assign investigation to ML Lead
☐ Assign impact assessment to Risk Manager

Phase 4: Investigation (15-60 min)

☐ ML: Check data quality (nulls, outliers, shifts)
☐ ML: Check model (deployments, code changes, configs)
☐ ML: Check outputs (actual decisions, thresholds)
☐ Form hypothesis with root cause
☐ Propose remediation

Phase 5: Decision (60-120 min)

☐ Risk Manager: Assess business impact
☐ Risk Manager: Determine regulatory obligation
☐ Risk Manager: Approve or reject remediation
☐ Execute remediation or escalate

Phase 6: Closure (same day)

☐ Document incident (date, model, root cause, action)
☐ Update model risk register
☐ Schedule post-mortem meeting
☐ Notify stakeholders of resolution

Contact card (bottom of sheet):

Incident Commander on-call: [Phone/Slack]
ML Lead on-call: [Phone/Slack]
Risk Manager on-call: [Phone/Slack]
Escalation if all unreachable: [CFO/COO number]

Looking Ahead (2026-2030)

2026-2027: Regulators shift from "Do you have an incident response plan?" to "Show us your metrics."

Fed wants to see: Average detection time, average resolution time, regulatory notification accuracy
Your runbook becomes compliance evidence. You'll present incident metrics in exams.

2027-2028: AI incidents become more common, not less.

50% more models in production by 2028 (industry growth)
More models = more drift, more fairness issues, more incidents
Banks with best incident response will have competitive advantage (regulators trust them faster for new models)

2028-2030: Automation in incident response increases.

Auto-pause triggers when Red alert fires (no human decision needed for autonomy level 4)
Auto-rollback to previous model version
Automated fairness remediation (retrain with constraints)
But: Human approval still required for all critical decisions. Regulators won't accept "the system paused itself"

HIVE Summary

Key takeaways:

AI incidents require different escalation than infrastructure incidents. A halved confidence isn't the same as a database outage—it's a business failure, not a system failure.
Pre-built escalation matrix removes judgment calls. Alert fires → Classification happens → Action is predetermined. No 30-minute emergency calls trying to figure out what to do.
Three axes determine escalation: Autonomy level (how much does the model decide?), Impact scope (how many decisions?), and Criticality (how important?). Combination determines timeline and who gets involved.
Detection must be automated with clear thresholds (Green/Yellow/Red). Humans can't monitor 20 models simultaneously. Alerts must be tuned to signal vs. noise.
Incident Commander role is critical—someone owns the first 15 minutes and makes the initial decision (halt/pause/investigate). Everything else flows from there.

Start here:

If you have production AI models but no incident response: Audit your models first. Document autonomy level, scope, criticality for each. That's the foundation.
If you have alerts but no escalation rules: Build your escalation matrix this month. Laminate it. Put it in your war room. Share with on-call team.
If you've had an AI incident: Post-mortem isn't about blame. It's about building a runbook so the same incident doesn't happen twice. Do that immediately.

Looking ahead (2026-2030):

Fed increasingly expects incident response metrics. Detection time, resolution time, notification accuracy. These become compliance KPIs.
Auto-pause and auto-remediation will emerge, but human judgment remains for critical decisions.
Banks with mature incident response will deploy AI faster (regulators trust them more).

Open questions:

How do you define "material issue" requiring regulatory notification? (Fed says 24 hours, but what triggers that clock?)
When a model is paused, should it fall back to a previous version or manual decision-making? (Different banks choose differently.)
How do you prevent alert fatigue when 20+ models all have yellow alerts? (Aggregation and prioritization are active research problems.)

Jargon Buster

Escalation Matrix: A table mapping alert severity + incident characteristics to specific actions and stakeholders. Why it matters in BFSI: Regulators expect clear decision rules, not judgment calls. A documented matrix shows auditors that you think systematically about AI incidents.

Drift Detection: Automated monitoring that flags when model inputs (features) or outputs (predictions) change significantly from training baseline. Why it matters in BFSI: Fed guidance requires "continuous monitoring" of models. Drift detection is how you do continuous monitoring without 24/7 manual reviews.

Autonomy Level: Classification of how much a model decides independently vs. requiring human review. Level 1 = human always reviews. Level 4 = model decides autonomously. Why it matters in BFSI: Level 4 incidents need faster response (decisions executing immediately). Level 1 incidents can wait until next business day (humans review anyway).

Root Cause Analysis: Investigation answering "why did this happen?" not just "what happened?" Example: Model confidence dropped (what) because feature X went null due to upstream schema change (why). Why it matters in BFSI: Regulators ask "did you understand what broke?" Not understanding root cause means you can't prevent it again.

False Positive Alert: Alert that triggers without a real incident. Example: Model confidence drops 8% because applicant pool shifted (normal), not because model broke. Why it matters in BFSI: Too many false positives = teams ignore alerts = you miss real incidents. Alert thresholds must be carefully tuned.

War Room: Temporary team assembled for serious incidents—Incident Commander, ML Lead, Risk Manager, sometimes Legal. Coordinating in Slack channel or Zoom call. Why it matters in BFSI: Serious AI incidents need real-time coordination. Communication delays = longer resolution = bigger business impact.

Regulatory Notification: Telling your regulator (Fed, OCC, EBA, FCA) about a material incident. Why it matters in BFSI: Fed expects 24-hour notification for material issues. Failing to notify = additional violation on top of the original incident. Being slow = "we didn't take this seriously" signal.

Runbook: Written, step-by-step procedure showing what to do when incident happens. Not a guide. Specific actions: "When this alert fires, do X, Y, Z in that order." Why it matters in BFSI: At 3 AM, people don't think clearly. A runbook removes thinking—follow the steps. Regulators expect documented runbooks, not ad-hoc responses.

Fun Facts

On Alert Fatigue: A major US bank set up 47 separate model monitoring alerts across credit, fraud, and compliance systems. Within 3 months, 62% of alerts were ignored because they fired constantly on non-critical changes. They discovered that 78% of alerts could be consolidated into 5 meaningful thresholds. The lesson: More alerts ≠ more insight. Fewer, better-tuned alerts catch more real incidents because teams actually pay attention.

On Fairness Incident Response: One bank discovered their loan denial model had a 9% fairness disparity (women getting denied more than men) during a quarterly audit—but their monitoring system had been running for 6 months without flagging it. Root cause: They measured "monitoring dashboard accuracy" (is the dashboard showing correct numbers?), not "model fairness" (is the model making fair decisions?). They added direct model behavior checks after that. Lesson: Monitor the monitor. Your monitoring system can have bugs too.

For Further Reading

NIST Incident Response Framework (NIST, 2024) | https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-61r2.pdf | Government playbook for incident response. Mostly IT-focused, but escalation matrix approach applies to AI incidents.

Fed Guidance on Model Risk Management (Federal Reserve, 2025) | https://www.federalreserve.gov/publications/files/bcreg20250124a.pdf | Specific requirements for model monitoring, documentation, and regulatory notification. Core compliance reading.

EBA Guidelines on AI Governance (European Banking Authority, 2026) | https://www.eba.europa.eu/regulation-and-policy/artificial-intelligence/guidelines-artificial-intelligence-governance | European expectations for AI incident handling and monitoring requirements.

Building Observable ML Systems (Google Research + Netflix, 2025) | https://research.google/blog/detecting-drift-in-ml-systems/ | Technical deep-dive on drift detection, monitoring architecture, and production patterns at scale.

Post-Mortem Culture and Blameless Incident Analysis (Google SRE Book, adapted for ML 2024) | https://sre.google/books/ | How to run effective post-mortems that drive improvement, not blame. Critical for building psychological safety in incident response.

Next up: Week 17 Sunday dives into "How Risk Committees Interpret AI Outputs"—because your beautiful incident response matrix means nothing if the risk committee doesn't trust it or understand what they're looking at.

This is part of our ongoing work understanding AI deployment in financial systems. If you're building runbooks or rebuilding incident response for AI models, share your experience—what worked, what didn't, what regulatory feedback you got?