Alerts by Criticality Level
Intelligently prioritize your alerts to react quickly to important incidents.
Not all alerts are equal. A production site down at 2am deserves an immediate reaction. An SSL certificate expiring in 30 days can wait until the next business day. Treating all alerts the same way leads to two problems: either you drown in noise, or you miss critical incidents.
Classification by criticality routes each alert to the right channel and the right person. Critical incidents trigger full escalation (SMS, call, manager notification). Informative alerts arrive quietly via email for planned action.
This guide helps you define your criticality levels, configure appropriate routing, and avoid classic pitfalls (everything critical = nothing critical).
The 4 Criticality Levels
Here is a standard classification adaptable to your context:
- Critical (P1): Major immediate business impact. Production site down, payments impossible, corrupted data. Immediate intervention required, even at 3am.
- High (P2): Significant service degradation. Very degraded performance, major functionality impacted, frequent errors. Quick intervention required during business hours.
- Medium (P3): Issue to monitor. Minor degradation, warning that could worsen, threshold approached. Planned action within 24-48h.
- Low (P4): Information or preventive maintenance. Certificate expires in 30 days, slightly delayed backup, unusual metric. Action next sprint.
Why Classify Alerts?
Classification by criticality brings several benefits:
- Focus on essentials: When everything is urgent, nothing is. Classification immediately identifies what requires immediate action.
- Noise reduction: Low priority alerts don't interrupt work. They're processed in batch at dedicated times.
- Better sleep: Only real emergencies wake up the on-call team. Less fatigue = better decisions.
- Relevant metrics: Measure MTTR by criticality level. High MTTR on P1 is a problem. High MTTR on P4 is normal.
How to Define Levels
Follow this methodology to classify your monitors:
- List your services: Enumerate all monitored services: website, API, database, third-party services, scheduled tasks.
- Evaluate business impact: For each service: how much does an hour of unavailability cost? How many users impacted? What about brand image?
- Define thresholds: Main site = P1 if down. Secondary API = P2. Internal service = P3. Staging environment = P4.
- Document: Create a classification matrix accessible to the whole team. Review regularly.
- Iterate: After a few weeks, analyze: too many P1s? Maybe some should be P2. Never P1? Maybe you're missing critical monitoring.
Routing by Criticality
Each criticality level deserves different treatment:
| Level | Channel | Expected response time |
|---|---|---|
| Critical (P1) | SMS + Call + Email | < 15 minutes |
| High (P2) | SMS + Email | < 1 hour (business hours) |
| Medium (P3) | Email + Slack | < 24 hours |
| Low (P4) | Email digest | < 1 week |
Classification Examples
Here's how to classify common alerts:
- Production site down (5xx): P1 - Critical. Every minute counts. Immediate SMS + escalation if no acknowledgment within 5 minutes.
- API latency > 2 seconds: P2 - High. Significant user impact but service works. Email + SMS during business hours.
- SSL certificate expires in 14 days: P3 - Medium. Enough time to act. Email for planned action.
- Disk space > 70%: P4 - Low. Preventive warning. Weekly email digest with all P4 alerts.
Best Practices
Avoid classic classification pitfalls:
- Avoid "everything critical": If more than 10% of your alerts are P1, you have a classification problem. Re-evaluate with actual business impact criterion.
- Review regularly: A new service may deserve P1. An old decommissioned service can move to P3. Review the matrix quarterly.
- Train the team: All members should understand the classification and know how to react to each level.
- Analyze incidents: A P1 incident that had no business impact was perhaps misclassified. Use post-mortems to refine.
Classification Checklist
- Define the 4 levels with concrete examples
- Classify all existing monitors
- Configure routing by level
- Document classification matrix
- Train team on expected response times
Frequently Asked Questions
How many criticality levels?
Four levels (P1-P4) are generally sufficient. More levels complicates classification without adding value. Fewer levels doesn't differentiate enough.
Can a monitor change criticality?
Yes, and it's normal. High latency can be P3 during day (team present) and P2 at night (less tolerance). Use time-based rules.
Who decides criticality?
Ideally, product owner and technical team together. PO knows business impact, technical team knows dependencies.
How to handle disagreements on criticality?
When in doubt, start with higher level. It's easier to lower criticality after analysis than to raise it after a missed incident.
Are staging environments ever critical?
Rarely P1, but staging down before a major deployment may deserve P2. Context matters.
How does MoniTao handle criticalities?
Each MoniTao monitor has a configurable criticality level. This level determines notification channels and escalation delays.
Conclusion
Classification by criticality is fundamental for effective alert management. It allows reacting quickly to real problems while avoiding alert fatigue on minor incidents.
Take time to define your classification matrix, configure appropriate routing in MoniTao, and review regularly. Your team will thank you.
Useful Links
Ready to Sleep Soundly?
Start free, no credit card required.