Cron Failure Alerts
Get alerted immediately when a cron fails to react before business impact
A failing cron represents a direct business risk. Unlike a silent cron that stops executing, a failing cron runs but encounters a problem: unhandled exception, timeout, unavailable resource. Without proper alerting, these failures go unnoticed for hours, even days.
The consequences of an undetected failure vary by cron criticality: incomplete backup, unsynchronized customer data, unreported reports, unsent invoices. Every minute of detection delay increases the potential impact and complicates resolution.
MoniTao offers two complementary mechanisms to detect failures: missing ping (the cron doesn't signal its success) and the /fail endpoint (the cron explicitly signals its failure). Together, they guarantee complete coverage of failure scenarios.
Types of Cron Failures
Cron failures manifest in different ways, each requiring an adapted detection approach:
- Failure with return code: The script terminates with exit 1 or a non-zero code. Easiest to detect because it's explicit. Use && to condition the ping on success.
- Uncaught exception: An error in the code (PHP, Python, etc.) interrupts execution. The script doesn't finish properly, so no success ping.
- Timeout: The cron exceeds its normal execution time and is interrupted by the system. The end ping is never sent.
- Partial failure: The script terminates successfully (exit 0) but part of the business logic failed. Harder to detect without explicit validation.
Common Failure Causes
Understanding common causes helps better anticipate and diagnose problems:
- Exhausted resources: Full disk, insufficient memory, saturated database connections. The cron starts but cannot complete its work.
- External dependencies: Third-party API down, cloud service unavailable, blocking firewall. The cron is functional but its dependencies are not.
- Invalid data: Corrupted input file, unexpected format, missing data. The cron receives data it cannot process.
- Concurrency problems: Database locks, conflicts with other processes, deadlocks. The cron waits indefinitely for a resource.
Failure Detection Strategies
MoniTao offers several approaches to detect cron failures:
- Conditional ping (&&): The simplest: ./script.sh && curl URL. The ping is only sent if the script returns 0. Any failure = no ping = alert.
- Explicit /fail endpoint: Call /fail instead of /ping when you detect a problem in your script. The alert is immediate, without waiting for timeout.
- Wrapper script: A wrapper that captures errors, logs them, and signals the appropriate status to MoniTao. Reusable for all your crons.
- Business validation: Beyond return code, validate that business logic worked (record count, files created) before pinging success.
Complete Wrapper Script
Here's a robust shell wrapper that handles all failure scenarios:
#!/bin/bash
# wrapper.sh - Universal wrapper for crons with MoniTao
# Usage: ./wrapper.sh "command" "https://api.monitao.com/ping/token"
COMMAND=$1
MONITAO_URL=$2
LOG_FILE="/var/log/cron/$(date +%Y%m%d_%H%M%S).log"
# Execute command and capture output
echo "[$(date)] Starting: $COMMAND" >> "$LOG_FILE"
$COMMAND >> "$LOG_FILE" 2>&1
EXIT_CODE=$?
echo "[$(date)] Finished with code: $EXIT_CODE" >> "$LOG_FILE"
# Signal MoniTao based on result
if [ $EXIT_CODE -eq 0 ]; then
curl -fsS --max-time 10 "$MONITAO_URL" \
-d "{\"status\": \"success\", \"duration\": \"$(tail -1 $LOG_FILE)\"}" \
-H "Content-Type: application/json"
else
# Replace /ping with /fail in URL
FAIL_URL=$(echo "$MONITAO_URL" | sed 's/\/ping\//\/fail\//')
curl -fsS --max-time 10 "$FAIL_URL" \
-d "{\"status\": \"failed\", \"exit_code\": $EXIT_CODE}" \
-H "Content-Type: application/json"
fi
exit $EXIT_CODE
This wrapper captures cron output to a log file, checks the return code, and signals the appropriate status to MoniTao. On failure, it uses the /fail endpoint for immediate alert. Use it in your crontab like: 0 * * * * /path/to/wrapper.sh "/path/to/script.sh" "URL"
Alert Configuration
Optimize your alerts for effective failure response:
- Immediate alert on /fail: The /fail endpoint triggers an instant alert, without waiting for timeout. Ideal for critical failures requiring quick intervention.
- Delay on missing ping: If the cron doesn't ping, alert comes after interval + grace period. Configure short grace for critical crons.
- Alert escalation: Configure escalation levels: immediate email, then Slack after 15 minutes without response, then SMS after 1 hour.
- Alert context: Include useful information in the /fail payload (error code, message, last log lines) to facilitate diagnosis.
Failure Handling Best Practices
Adopt these practices for robust cron failure management:
- set -e in bash: Add set -e at script start to stop execution on first error. Prevents partial execution and inconsistent states.
- Application try/catch: In PHP, Python, etc. scripts, wrap logic in a global try/catch. Catch exception, log it, and signal /fail.
- Systematic logging: Log the start, end, and key steps of each cron. On failure, logs help quickly locate the problem.
- Non-regression tests: Regularly test your alerts by forcing a failure. Verify that alert arrives within expected time and contains useful info.
Failure Alerting Checklist
- Scripts with set -e or equivalent
- Global try/catch in application scripts
- Conditional ping (&&) configured
- /fail endpoint used for explicit failures
- Cron logs configured and accessible
- Alerts tested with simulated failure
Frequently Asked Questions
What's the difference between missing ping and the /fail endpoint?
Missing ping means the cron didn't execute at all (silent) or didn't finish (crash, timeout). The /fail endpoint means the cron executed but detected a problem and is informing you explicitly. /fail triggers an immediate alert, missing ping waits for timeout.
Can I have different alert priorities based on criticality?
Yes, create separate heartbeats for critical crons (immediate email + Slack alert) and less critical crons (email only after delay). Configure notification channels according to criticality level.
How do I automatically restart a cron after failure?
MoniTao detects failures but doesn't restart crons. For automatic restart, use systemd with Restart=on-failure, supervisord, or implement retry logic in your script. Monitor retry count to avoid infinite loops.
Exactly how long before the alert arrives?
For /fail: immediately (a few seconds). For missing ping: configured interval + grace period. For example, an hourly cron with 10 minutes grace will trigger alert 70 minutes after last successful ping.
How do I include the error message in the MoniTao alert?
The /fail endpoint accepts a JSON payload. Include a "message" field with the error: curl URL -d '{"status": "failed", "message": "Database connection refused"}'. This message will appear in the notification.
My cron fails intermittently. How do I reduce false alerts?
Implement retry logic in your script (3 attempts with exponential delay) before signaling failure. You can also configure MoniTao to only alert after X consecutive failures, filtering out transient issues.
React Quickly to Failures
Cron failures are inevitable: external dependencies, exhausted resources, invalid data. What makes the difference is detection and reaction speed. A failure detected in 5 minutes can often be fixed before any user impact. A failure detected after 24 hours leaves lasting traces.
Combine conditional ping (&&) for implicit failures with the /fail endpoint for explicit failures. Wrap your crons in a wrapper that logs, detects, and signals. With MoniTao, turn every failure into an opportunity for quick improvement rather than a customer incident.
Useful Links
Ready to Sleep Soundly?
Start free, no credit card required.