Monitoring Automations
Monitoring Automations
Section titled “Monitoring Automations”Effective monitoring ensures your automated tasks run reliably and helps you quickly identify and resolve issues. This guide covers how to track execution history, monitor success rates, set up alerts, and troubleshoot failures.
Execution History
Section titled “Execution History”A complete record of all automation runs helps you understand patterns and diagnose issues.
Viewing Execution History
Section titled “Viewing Execution History”Access the history:
- Navigate to your dashboard or project
- Click on “Automation” in the settings
- Select “Execution History”
- View the log of all runs
History view displays:
Job Name | Type | Status | Start Time | Duration | Records | Actions──────────────────────────────────────────────────────────────────────Daily Sales | Dashboard | Success | 8:00 AM | 2m 15s | 1,234 | [View][Retry]Weekly Report | Email | Success | 9:00 AM | 45s | N/A | [View][Resend]Data Import | Step | Failed | 2:00 AM | 5m 30s | 0 | [View][Retry]Hourly Sync | Step | Running | 3:00 PM | 1m 12s | 850 | [View][Cancel]Detailed Execution View
Section titled “Detailed Execution View”Click on any execution to see details:
Overview Section:
Job: Daily Sales Dashboard RefreshType: Dashboard AutomationStatus: Success ✓Started: Jan 15, 2024 at 8:00:00 AM ESTCompleted: Jan 15, 2024 at 8:02:15 AM ESTDuration: 2 minutes 15 secondsTriggered by: Schedule (cron: 0 8 * * *)Steps Executed:
Step 1: Fetch sales data Status: Success ✓ Duration: 45s Records: 1,234 rows
Step 2: Calculate metrics Status: Success ✓ Duration: 30s Metrics computed: 15
Step 3: Update widgets Status: Success ✓ Duration: 60s Widgets updated: 8Resource Usage:
Database queries: 12Query time: 1.2sMemory peak: 256 MBCPU time: 45sOutput and Logs:
[08:00:00] Starting dashboard refresh...[08:00:01] Connecting to sales database...[08:00:02] Connection established[08:00:02] Executing query: SELECT * FROM sales WHERE...[08:00:45] Retrieved 1,234 records[08:00:45] Starting metric calculations...[08:01:15] Metrics calculated successfully[08:01:15] Updating dashboard widgets...[08:02:15] Dashboard refresh complete ✓Filtering and Searching
Section titled “Filtering and Searching”Find specific executions quickly:
Filter by Status:
☑ Success☑ Failed☑ Running☐ Cancelled☐ SkippedFilter by Date Range:
Date range: Last 7 daysCustom: Jan 1, 2024 - Jan 15, 2024Presets: Today, Yesterday, Last 7 days, Last 30 days, This monthFilter by Job Type:
☑ Dashboard Refresh☑ Email Report☑ Data Export☑ Step Execution☑ Full PipelineSearch:
Search by:- Job name- Error message- Tag- Triggered by
Example searches:"sales dashboard""connection timeout""tag:production""triggered:manual"Export History
Section titled “Export History”Export execution history for analysis:
Export formats:
- CSV for spreadsheet analysis
- JSON for programmatic access
- PDF for reports
Export options:
Date range: Last 30 daysInclude: All fieldsFilter: Failed executions onlyGroup by: Job nameSort by: Start time (descending)Success/Failure Tracking
Section titled “Success/Failure Tracking”Monitor the reliability of your automations over time.
Success Rate Metrics
Section titled “Success Rate Metrics”Overall Success Rate:
Last 7 days: 98.5% (197/200 runs successful)Last 30 days: 97.2% (583/600 runs successful)All time: 96.8% (9,680/10,000 runs successful)By Job:
Job Name | Runs | Success | Failed | Rate──────────────────────────────────────────────────────Daily Sales Dashboard | 30 | 30 | 0 | 100%Weekly Email Report | 4 | 4 | 0 | 100%Hourly Data Sync | 720 | 708 | 12 | 98.3%Monthly Rollup | 1 | 0 | 1 | 0%Trends Over Time:
Success Rate Trend (Daily)
100% ████████████████████████████████████████ 98% ████████████████████████████░░░█████████ 96% ████████████████████████████░░░█████████ 94% ████████████████████████████░░░█████████ 92% ████████████████████████████░░░█████████ Mon Tue Wed Thu Fri Sat Sun
Note: Drop on Friday due to database maintenancePerformance Metrics
Section titled “Performance Metrics”Execution Time Statistics:
Job: Daily Sales Dashboard
Average execution time: 2m 15sMedian: 2m 10sMin: 1m 45sMax: 5m 30sStd deviation: 45s
Percentiles: p50: 2m 10s (half of runs complete within this time) p90: 2m 45s (90% of runs complete within this time) p95: 3m 15s p99: 4m 30sExecution Time Trends:
Avg Duration Trend (Last 30 days)
5m ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░█4m ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░█3m ░░░░░░░░░░░░░░░░░░░░░░░░░█████2m ████████████████████████████░░1m ████████████████████████████░░ Week 1 Week 2 Week 3 Week 4
Note: Slowdown in Week 4 due to data growthAction: Consider optimization or incremental refreshData Volume Metrics
Section titled “Data Volume Metrics”Records Processed:
Job: Hourly Data Sync
Records per run: Average: 1,250 Median: 1,200 Min: 850 Max: 2,500
Total processed (last 30 days): 900,000 recordsGrowth rate: +5% per weekVolume Trends:
Records Processed (Last 7 days)
2500 ░░░░░░░░░░░░░░░░░░░░░█2000 ░░░░░░░░░░░░░░░░░░░░░█1500 ░░░░░░░█████████████░█1000 █████████████████████░█ 500 █████████████████████░█ Mon Tue Wed Thu Fri Sat Sun
Note: Spike on Sunday likely due to weekend promotionsComparative Analysis
Section titled “Comparative Analysis”Compare Jobs:
Metric | Sales DB | Marketing DB | Product DB───────────────────────────────────────────────────────────Success Rate | 100% | 95% | 90%Avg Duration | 2m 15s | 5m 30s | 8m 45sAvg Records | 1,200 | 3,500 | 500Resource Usage (CPU) | Low | Medium | HighCompare Time Periods:
Metric | This Week | Last Week | Change─────────────────────────────────────────────────Success Rate | 98% | 100% | -2%Avg Duration | 2m 30s | 2m 15s | +11%Total Runs | 168 | 168 | 0Failed Runs | 3 | 0 | +3Alerts and Notifications
Section titled “Alerts and Notifications”Stay informed about automation health with proactive alerts.
Alert Types
Section titled “Alert Types”Failure Alerts:
Alert: Immediate notification when job failsTrigger: Status = FailedChannel: Email + SlackRecipients: On-call engineer, team leadFrequency: Immediate (for each failure)
Example:Subject: [ALERT] Hourly Data Sync FailedMessage: The Hourly Data Sync job failed at 3:00 PM EST.Error: Connection timeout to data sourceRetry: Will retry in 5 minutes (attempt 1 of 3)Performance Alerts:
Alert: Job taking longer than expectedTrigger: Duration > p95 (95th percentile)Channel: EmailRecipients: Engineering teamFrequency: Once per day (digest)
Example:Subject: [WARNING] Daily Sales Dashboard Running SlowMessage: Today's dashboard refresh took 4m 30s, compared toaverage of 2m 15s. This may indicate data growth or performancedegradation.Threshold Alerts:
Alert: Success rate drops below thresholdTrigger: Success rate < 95% (7-day window)Channel: Email + PagerDutyRecipients: Engineering manager, DevOpsFrequency: Once when threshold crossed, daily while below
Example:Subject: [CRITICAL] Automation Success Rate Below ThresholdMessage: The 7-day success rate has dropped to 92.5%, below the95% threshold. 15 of 200 runs have failed in the past week.Primary cause: Database connection timeouts (60% of failures)Anomaly Alerts:
Alert: Unusual pattern detectedTrigger: AI-detected anomalyChannel: EmailRecipients: Data teamFrequency: When anomaly confidence > 90%
Example:Subject: [ANOMALY] Unusual Record Count in Data SyncMessage: Today's data sync processed 5,200 records, significantlyhigher than the typical range of 1,000-1,500. This may indicate:- Data quality issue (duplicates)- Unexpected increase in activity- System errorPlease investigate.Notification Channels
Section titled “Notification Channels”Email:
Configuration: To: engineering@company.com CC: manager@company.com Subject template: [${severity}] ${job_name} ${status} Include: Error details, logs, retry information Attach: Execution summary reportSlack:
Configuration: Channel: #data-alerts Mention: @on-call-engineer (for critical alerts) Format: Rich message with action buttons Thread: Group related alerts in threads
Example message:🔴 ALERT: Hourly Data Sync FailedJob: Hourly Data SyncTime: 3:00 PM ESTError: Connection timeoutRetry: In 5 minutes (1/3)[View Logs] [Retry Now] [Disable Job]Microsoft Teams:
Configuration: Team: Engineering Channel: Automation Alerts Format: Adaptive card Include: Quick actionsSMS (for critical alerts only):
Configuration: Recipients: On-call phone numbers Triggers: Critical failures only Rate limit: Max 3 SMS per hour
Example:CRITICAL: Daily Sales Dashboard failed after 3 retries at 8:15 AM.Dashboards not updated. View: https://app.querri.com/jobs/12345Webhook (for custom integrations):
Configuration: URL: https://your-system.com/webhook Method: POST Headers: Authorization: Bearer ${token} Payload: { "job_id": "12345", "job_name": "Daily Sales Dashboard", "status": "failed", "timestamp": "2024-01-15T08:15:00Z", "error": "Connection timeout", "retry_count": 3, "logs_url": "https://app.querri.com/logs/12345" }Alert Configuration
Section titled “Alert Configuration”Severity Levels:
Info: FYI, no action needed - Job completed successfully after retry - Longer than average execution time - Daily/weekly digest summaries
Warning: Monitor situation - Single failure (will retry) - Performance degradation - Data volume anomaly
Error: Action recommended - Multiple consecutive failures - Success rate below threshold - Resource usage high
Critical: Immediate action required - All retries exhausted - Critical pipeline failed - Data integrity issue - Scheduled report not sentAlert Schedules:
Immediate: Send as soon as condition is met - Critical failures - Security issues
Batched: Collect and send periodically - Performance warnings (hourly digest) - Info messages (daily digest)
Throttled: Limit frequency to prevent spam - Max 1 alert per 15 minutes for same job - Combine repeated failures into single alert - Escalate after threshold (3 alerts → page manager)Alert Rules:
Rule: Failed Job Alert When: Job status = Failed And: Retry count = Max retries Send: Immediate Channel: Email + Slack Severity: Error
Rule: Performance Degradation When: Avg duration (7 days) > p95 (30 days) And: Trend is increasing Send: Daily digest Channel: Email Severity: Warning
Rule: Success Rate Alert When: Success rate (7 days) < 95% Send: Once when crossed + daily while below Channel: Email + PagerDuty Severity: CriticalTroubleshooting Failed Runs
Section titled “Troubleshooting Failed Runs”When automation fails, systematic troubleshooting helps resolve issues quickly.
Diagnosing Failures
Section titled “Diagnosing Failures”Step 1: Check the Error Message
Common error patterns:
"Connection timeout" → Network issue or database overload → Check: Database status, network connectivity, firewall rules
"Authentication failed" → Credentials expired or incorrect → Check: API keys, database passwords, token expiration
"Query returned no results" → Data source empty or filter too restrictive → Check: Data source, query filters, date ranges
"Out of memory" → Processing too much data at once → Check: Record count, data size, memory limits
"Rate limit exceeded" → Too many requests to API → Check: Request frequency, API quotas, throttlingStep 2: Review Execution Logs
Look for: - Last successful step before failure - Warning messages before error - Resource usage spikes - Timing of failure (always same time? random?)
Example log analysis:[08:00:00] Starting dashboard refresh...[08:00:01] Connecting to database...[08:00:02] Connection established[08:00:02] Executing query...[08:00:45] Warning: Query taking longer than usual[08:01:30] Warning: High memory usage (512 MB)[08:02:15] Error: Out of memory ← Failure point
Diagnosis: Query returning too much data, causing OOMSolution: Add LIMIT clause or use incremental refreshStep 3: Compare with Successful Runs
What's different about failed runs?
Successful run (yesterday): - Duration: 2m 15s - Records: 1,200 - Memory: 128 MB
Failed run (today): - Duration: Failed at 2m 15s - Records: 5,800 (4.8x higher!) - Memory: 512 MB → OOM
Root cause: Unexpected spike in data volumeCommon Issues and Solutions
Section titled “Common Issues and Solutions”Connection Issues:
Problem: "Connection timeout to database"
Troubleshooting:1. Check if database is running → Try connecting manually2. Verify network connectivity → Ping database server3. Check firewall rules → Ensure Querri IP is whitelisted4. Verify credentials → Test with database client5. Check connection pool → May be exhausted during peak times
Solutions:- Increase connection timeout- Add database to firewall whitelist- Rotate credentials if expired- Schedule during off-peak hours- Increase connection pool sizeQuery Errors:
Problem: "Invalid column name 'customer_id'"
Troubleshooting:1. Check if schema changed → Compare with previous runs2. Verify query syntax → Test query in SQL client3. Check table/view exists → List tables in database
Solutions:- Update query to use correct column name- Add schema validation step- Use aliases if table structure changed- Set up schema change alertsData Quality Issues:
Problem: "Data validation failed: null values in required field"
Troubleshooting:1. Check data source → Verify upstream process completed2. Review data quality → Sample recent records3. Check data pipeline → Verify transformation steps
Solutions:- Add data quality checks earlier in pipeline- Set up alerts for upstream failures- Use default values for missing data- Add data cleaning stepResource Limitations:
Problem: "Job exceeded maximum execution time (10 minutes)"
Troubleshooting:1. Identify slow steps → Review execution timeline2. Check data volume → Compare with historical runs3. Analyze query performance → Check execution plans
Solutions:- Optimize slow queries- Add database indexes- Use incremental instead of full refresh- Increase timeout limit- Split into multiple smaller jobsRetry Strategies
Section titled “Retry Strategies”When to Retry Immediately:
Transient errors that may resolve quickly:- Network timeout- Database temporarily unavailable- Rate limit (with backoff)
Configuration: Retry: Immediately with exponential backoff Max attempts: 3 Delay: 1min, 2min, 4minWhen to Wait Before Retry:
Issues that need time to resolve:- Downstream service maintenance- Scheduled database backup- Known outage window
Configuration: Retry: After maintenance window Max attempts: 1 Delay: Until 10:00 AM (after backup completes)When Not to Retry:
Permanent errors that won't resolve automatically:- Invalid credentials- Syntax errors in query- Schema mismatch- Permission denied
Action: Send alert immediately Require manual fix Don't retry automaticallyManual Intervention
Section titled “Manual Intervention”Fixing and Rerunning:
Steps:1. Identify root cause from logs2. Fix the underlying issue - Update credentials - Optimize query - Increase resources3. Test the fix manually - Run job manually - Verify it succeeds4. Re-enable automation5. Monitor next few runs closelyPartial Recovery:
If job partially completed:1. Identify last successful checkpoint2. Note which data was processed3. Modify job to resume from checkpoint4. Run manually to complete5. Verify data integrity
Example:Job: Load 100,000 recordsFailed at: Record 75,432Recovery: - Add WHERE id > 75,432 to query - Run manually to process remaining records - Verify total count = 100,000 - Re-enable normal automationMonitoring Dashboard
Section titled “Monitoring Dashboard”Create a centralized view of automation health:
Key Widgets
Section titled “Key Widgets”Overall Health:
┌─────────────────────────────┐│ Automation Health ││ ││ Success Rate (7d): 98.5% ✓ ││ Active Jobs: 25 ││ Running Now: 3 ││ Failed Today: 1 │└─────────────────────────────┘Recent Failures:
┌─────────────────────────────┐│ Recent Failures ││ ││ • Hourly Sync - 3:00 PM ││ Connection timeout ││ • Monthly Report - 9:00 AM ││ Query error │└─────────────────────────────┘Execution Timeline:
Timeline (Last 24 hours)━━━━━━━━━━━━━━━━━━━━━━━━━━━━✓✓✓✓✗✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓00:00 12:00 24:00
✓ = Success, ✗ = FailedPerformance Trends:
Avg Execution Time (30 days) 3m ██░░░░░░░░░░░░░░░░░░░░ 2m ████████████████████░░ 1m ████████████████████░░ Week1 Week2 Week3 Week4Best Practices
Section titled “Best Practices”- Set up alerts early: Don’t wait for problems to occur
- Monitor trends: Look for gradual degradation, not just failures
- Review logs regularly: Weekly review of execution history
- Test alert delivery: Ensure notifications reach the right people
- Document common issues: Build a troubleshooting playbook
- Automate recovery when possible: Use retry logic and fallbacks
- Track changes: Log all modifications to automations
- Regular audits: Monthly review of all active automations
- Clean up old jobs: Remove obsolete automations
- Share learnings: Document and share solutions with team
Next Steps
Section titled “Next Steps”- Scheduling Basics - Optimize your automation schedules
- Automated Reports - Monitor email report delivery
- Recurring Analysis - Track data pipeline health
Build reliable, monitored automation that you can trust!