Monitoring Glossary
All monitoring definitions and acronyms explained simply.
The monitoring world is full of acronyms and technical terms that can seem obscure at first. SLA, SLO, SLI, MTTR, MTTD, MTBF... Yet these terms are essential for communicating effectively with your team and partners.
This glossary gathers clear and precise definitions of all the terms you'll encounter in your monitoring journey. Each definition is accompanied by concrete examples to facilitate understanding.
Use this glossary as a quick reference or to onboard new team members. Bookmark this page - you'll come back to it often.
Availability and SLA
- Uptime
- Percentage of time a service is operational and accessible. 99.9% uptime means the service can be unavailable for maximum 8.77 hours per year.
- Downtime
- Period during which a service is unavailable or non-functional. Downtime can be planned (maintenance) or unplanned (incident).
- Availability
- Ability of a service to be operational when needed. Often expressed as a percentage (e.g., 99.95% availability).
- SLA (Service Level Agreement)
- Contractual agreement between a provider and customer defining guaranteed service levels. Usually includes availability commitments and penalties for non-compliance.
- SLO (Service Level Objective)
- Internal service level target that the team commits to achieving. Stricter than the SLA to have a safety margin.
- SLI (Service Level Indicator)
- Quantitative metric measuring an aspect of the service. E.g., successful request rate, p99 latency, page load time.
- MTTR (Mean Time To Recovery)
- Average time to restore a service after an incident. Measures team response speed. Typical target: < 1 hour.
- MTTD (Mean Time To Detect)
- Average time between an incident starting and being detected. Good monitoring reduces MTTD to a few minutes.
- MTBF (Mean Time Between Failures)
- Average time between two failures. Measures system reliability. Higher MTBF means more stable system.
Performance
- Latency
- Delay between a request and its response. Measured in milliseconds (ms). High latency degrades user experience.
- TTFB (Time To First Byte)
- Time between sending request and receiving first byte of response. Key indicator of server performance.
- Response Time
- Total time to receive a complete response. Includes network latency and processing time.
- Throughput
- Number of requests processed per unit of time. E.g., 1000 requests per second (RPS).
- Percentile (p50, p95, p99)
- Statistical measure indicating the value below which X% of observations fall. P99 = 99% of requests are faster than this value.
Alerts and Incidents
- Alert Fatigue
- Team desensitization to too many or irrelevant alerts. Leads to ignoring real alerts.
- False Positive
- Alert triggered when there's no real problem. Main cause of alert fatigue.
- Escalation
- Process of passing an incident to a higher support level or management when not resolved quickly.
- On-Call
- Period during which a team member must be available to respond to incidents, often outside business hours.
- Incident
- Unplanned event causing service interruption or degradation. Classified by severity (P1 to P4).
- Post-Mortem
- Retrospective analysis of an incident to understand root causes and define preventive actions. Blameless, focused on improvement.
Check Types
- Health Check
- Dedicated endpoint returning a service's health status. Usually /health or /status. Verifies critical dependencies.
- Heartbeat
- Periodic signal sent by an application to confirm it's working. Missing heartbeat triggers an alert.
- Synthetic Monitoring
- Automated tests simulating user journey. Detects problems before real users encounter them.
- RUM (Real User Monitoring)
- Performance measurement as experienced by real users. Collects data from browsers.
- Probe
- Remote monitoring point that checks service availability. Multiple geographically distributed probes improve reliability.
Frequently Asked Questions
What's the difference between SLA, SLO and SLI?
SLI is the metric measured, SLO is the internal target, SLA is the contractual commitment with penalties. E.g., SLI = 99.95% measured uptime, SLO = 99.9% target, SLA = 99.5% guaranteed to customer.
How to calculate MTTR?
MTTR = Sum of resolution times / Number of incidents. E.g., 3 incidents resolved in 30min, 45min, 15min -> MTTR = (30+45+15)/3 = 30 minutes.
What's a good uptime rate?
Depends on context. 99% (3.65 days downtime/year) is minimum acceptable. 99.9% (8.77h/year) is standard. 99.99% (52 min/year) is excellent.
Why use percentiles rather than averages?
Averages hide extreme values. P99 shows that 99% of users have a better experience than this value, revealing problematic cases.
Conclusion
Mastering monitoring vocabulary is essential for communicating effectively with your team and stakeholders. These terms will constantly come up in your discussions about reliability and performance.
Keep this glossary handy and share it with new team members. Common terminology is the foundation of effective collaboration.
Useful Links
Ready to Sleep Soundly?
Start free, no credit card required.