Batch Job Monitoring
Monitor your batch processing and heavy jobs with precision
Batch jobs are the hidden workhorses of IT infrastructure. Whether importing thousands of orders, synchronizing databases between multiple systems, generating nightly reports, or processing accumulated files, these processes are often scheduled to run outside business hours when no one is watching them.
This invisibility poses a major problem: when a batch job fails, the consequences don't appear immediately. It may take hours or even days to notice that the previous night's import never ran. By then, the damage is done: outdated data, inventory errors, missing invoices, desynchronized systems.
Heartbeat monitoring is the ideal solution for batch jobs. Instead of monitoring the server or service, we monitor the job's vital signs: did it start? Is it progressing? Did it finish successfully? If any signal is missing, the alert is triggered immediately.
The Specific Challenges of Batch Jobs
Unlike interactive applications that fail visibly when there's a problem, batch jobs present unique challenges:
- Variable execution time: The same batch can take 10 minutes or 2 hours depending on data volume. A timeout that's too tight triggers false alerts, one that's too loose misses real failures.
- Silent dependencies: A batch often depends on database availability, external APIs, shared file systems. The failure of a dependency is rarely explicit.
- Partial errors: A batch can process 9,900 items correctly and fail on 100. Should that be considered a success, failure, or something in between?
- Failures without errors: The job completes without error but produces incorrect results (empty file, desynchronized data). Technical monitoring doesn't catch this.
What to Monitor in a Batch Job
Un monitoring efficace des jobs batch couvre quatre aspects clés : démarrage, progression, terminaison et résultat.
- Démarrage effectif : Confirmez que le job a bien démarré à l'heure prévue. Un job qui ne démarre pas doit être détecté immédiatement.
- Progression du traitement : Pour les jobs longs, suivez la progression (items traités sur total). Cela permet de détecter un job bloqué ou anormalement lent.
- Durée d'exécution : Mesurez le temps entre le démarrage et la terminaison. Analysez les tendances pour détecter des dégradations progressives.
- Résultat et qualité : Vérifiez non seulement que le job s'est terminé, mais aussi la qualité du résultat : nombre d'éléments traités, nombre d'erreurs.
Step-by-Step Implementation
Integrating heartbeat into a batch job follows a simple pattern with three key moments:
- Ping de démarrage (/start)
Envoyez un ping au tout début de votre job batch. Cela enregistre l'heure de démarrage et lance le chronomètre. - Pings de progression (optionnel)
Pour les jobs longs, envoyez des pings réguliers avec le nombre d'items traités. Exemple : tous les 10 000 items ou toutes les 15 minutes. - Ping de terminaison (/complete ou /fail)
À la fin du job, envoyez un ping avec le statut final (succès ou échec), la durée totale, et les métriques clés. - Configuration du timeout
Définissez un timeout maximum au-delà duquel une alerte est déclenchée. Prenez la durée maximale observée + 30-50% de marge.
Complete Code Example
Here is a complete implementation example for a typical batch job in PHP:
$order) {
try {
processOrder($order);
$processed++;
} catch (Exception $e) {
logError($order->id, $e->getMessage());
$errors++;
}
// Progress ping every 1000 items
if (($index + 1) % 1000 === 0) {
$progress = round(($index + 1) / $total * 100);
file_get_contents($heartbeatUrl . "?progress={$progress}");
}
}
// Success or partial failure
$duration = round(microtime(true) - $startTime);
if ($errors === 0) {
file_get_contents($heartbeatUrl . "?status=success&processed={$processed}&duration={$duration}");
} else {
file_get_contents($heartbeatUrl . "/fail?processed={$processed}&errors={$errors}&duration={$duration}");
}
} catch (Exception $e) {
// Critical error
file_get_contents($heartbeatUrl . "/fail?error=" . urlencode($e->getMessage()));
exit(1);
}
This example covers the three key scenarios: normal operation with success ping, partial failures with tracking, and critical errors with immediate alert.
Alert Configuration
Proper alert setup is crucial to avoid false alarms while catching real problems:
- Missing startup: The batch was supposed to start at 3:00 AM but no startup ping was received. Possible cause: cron failed, server down, script permissions.
- Timeout exceeded: The batch started but hasn't completed within the expected time. Possible cause: processing stall, database lock, slow network.
- Failure reported: The batch explicitly reported an error via fail ping. The alert contains details sent with the request.
- No completion signal: The batch started (start ping received) but never sent success or fail. Possible cause: crash, infinite loop, memory exhaustion.
Common Use Cases
Nightly Data Import
Import files received via SFTP every night. The heartbeat ensures the import ran, the file was found, and the data was correctly integrated. A single grace period matching the typical import duration is sufficient.
Database Synchronization
Synchronize data between master database and replicas. Heartbeat monitoring detects not only sync failures but also abnormally long durations indicating potential network or performance issues.
Report Generation
Generate and send daily or weekly reports. The heartbeat ensures reports were generated on time. With metadata, you can verify expected file size and correct destination.
Batch Monitoring Checklist
- Identify all business-critical batch jobs
- Measure actual execution times over one week
- Define acceptable thresholds (duration, errors)
- Implement start and completion pings
- Add progress pings for jobs > 30 minutes
- Configure timeouts with sufficient margin
Frequently Asked Questions
My batch takes 1 to 3 hours depending on the day. What timeout should I set?
Set a 4-hour timeout (maximum observed duration + 33% margin). If the batch takes longer than its historical maximum, that itself is an anomaly worth investigating.
How do I monitor a batch processing millions of rows?
Add progress pings every 100,000 or 500,000 items. This lets you track advancement in real-time and detect stalls before the global timeout triggers.
My batch sometimes partially fails (some items OK, others fail). How do I handle this?
Define a clear policy: below X% errors = success, above = fail. Send the fail ping with error count. You can also create two heartbeats with different thresholds (warning vs critical).
My batch can be triggered manually or on schedule. How do I handle both?
Create two distinct heartbeats: one for scheduled executions with strict timeout, one for manual executions with wider timeout. Use a parameter to select the appropriate URL.
Can I track multiple parallel batches?
Yes, create one heartbeat per independent batch. For parallel batches that are part of the same workflow, consider an orchestrator that sends the final ping once all components complete.
What if my batch runs on a server without internet access?
Set up a local proxy that receives internal pings and forwards them to MoniTao. Alternatively, write a local log file that another monitored process reads and transmits.
Secure Your Batch Processing
Batch jobs are the invisible gears of your business. When they run correctly, nobody notices them. When they fail silently, the consequences can be disastrous: desynchronized data, missing reports, failed imports, frustrated customers.
Heartbeat monitoring with MoniTao transforms every batch job into a supervised process. With a few lines of code, you gain complete visibility: did it start? Is it progressing? Did it complete successfully? If any of these questions has a concerning answer, you're alerted immediately, not after discovering the problem manually.
Useful Links
Ready to Sleep Soundly?
Start free, no credit card required.