Skip to content

Monitoring Automations

Effective monitoring ensures your automated tasks run reliably and helps you quickly identify and resolve issues. This guide covers how to track execution history, monitor success rates, set up alerts, and troubleshoot failures.

A complete record of all automation runs helps you understand patterns and diagnose issues.

Access the history:

  1. Navigate to your dashboard or project
  2. Click on “Automation” in the settings
  3. Select “Execution History”
  4. View the log of all runs

History view displays:

Job Name | Type | Status | Start Time | Duration | Records | Actions
──────────────────────────────────────────────────────────────────────
Daily Sales | Dashboard | Success | 8:00 AM | 2m 15s | 1,234 | [View][Retry]
Weekly Report | Email | Success | 9:00 AM | 45s | N/A | [View][Resend]
Data Import | Step | Failed | 2:00 AM | 5m 30s | 0 | [View][Retry]
Hourly Sync | Step | Running | 3:00 PM | 1m 12s | 850 | [View][Cancel]

Click on any execution to see details:

Overview Section:

Job: Daily Sales Dashboard Refresh
Type: Dashboard Automation
Status: Success ✓
Started: Jan 15, 2024 at 8:00:00 AM EST
Completed: Jan 15, 2024 at 8:02:15 AM EST
Duration: 2 minutes 15 seconds
Triggered by: Schedule (cron: 0 8 * * *)

Steps Executed:

Step 1: Fetch sales data
Status: Success ✓
Duration: 45s
Records: 1,234 rows
Step 2: Calculate metrics
Status: Success ✓
Duration: 30s
Metrics computed: 15
Step 3: Update widgets
Status: Success ✓
Duration: 60s
Widgets updated: 8

Resource Usage:

Database queries: 12
Query time: 1.2s
Memory peak: 256 MB
CPU time: 45s

Output and Logs:

[08:00:00] Starting dashboard refresh...
[08:00:01] Connecting to sales database...
[08:00:02] Connection established
[08:00:02] Executing query: SELECT * FROM sales WHERE...
[08:00:45] Retrieved 1,234 records
[08:00:45] Starting metric calculations...
[08:01:15] Metrics calculated successfully
[08:01:15] Updating dashboard widgets...
[08:02:15] Dashboard refresh complete ✓

Find specific executions quickly:

Filter by Status:

☑ Success
☑ Failed
☑ Running
☐ Cancelled
☐ Skipped

Filter by Date Range:

Date range: Last 7 days
Custom: Jan 1, 2024 - Jan 15, 2024
Presets: Today, Yesterday, Last 7 days, Last 30 days, This month

Filter by Job Type:

☑ Dashboard Refresh
☑ Email Report
☑ Data Export
☑ Step Execution
☑ Full Pipeline

Search:

Search by:
- Job name
- Error message
- Tag
- Triggered by
Example searches:
"sales dashboard"
"connection timeout"
"tag:production"
"triggered:manual"

Export execution history for analysis:

Export formats:

  • CSV for spreadsheet analysis
  • JSON for programmatic access
  • PDF for reports

Export options:

Date range: Last 30 days
Include: All fields
Filter: Failed executions only
Group by: Job name
Sort by: Start time (descending)

Monitor the reliability of your automations over time.

Overall Success Rate:

Last 7 days: 98.5% (197/200 runs successful)
Last 30 days: 97.2% (583/600 runs successful)
All time: 96.8% (9,680/10,000 runs successful)

By Job:

Job Name | Runs | Success | Failed | Rate
──────────────────────────────────────────────────────
Daily Sales Dashboard | 30 | 30 | 0 | 100%
Weekly Email Report | 4 | 4 | 0 | 100%
Hourly Data Sync | 720 | 708 | 12 | 98.3%
Monthly Rollup | 1 | 0 | 1 | 0%

Trends Over Time:

Success Rate Trend (Daily)
100% ████████████████████████████████████████
98% ████████████████████████████░░░█████████
96% ████████████████████████████░░░█████████
94% ████████████████████████████░░░█████████
92% ████████████████████████████░░░█████████
Mon Tue Wed Thu Fri Sat Sun
Note: Drop on Friday due to database maintenance

Execution Time Statistics:

Job: Daily Sales Dashboard
Average execution time: 2m 15s
Median: 2m 10s
Min: 1m 45s
Max: 5m 30s
Std deviation: 45s
Percentiles:
p50: 2m 10s (half of runs complete within this time)
p90: 2m 45s (90% of runs complete within this time)
p95: 3m 15s
p99: 4m 30s

Execution Time Trends:

Avg Duration Trend (Last 30 days)
5m ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░█
4m ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░█
3m ░░░░░░░░░░░░░░░░░░░░░░░░░█████
2m ████████████████████████████░░
1m ████████████████████████████░░
Week 1 Week 2 Week 3 Week 4
Note: Slowdown in Week 4 due to data growth
Action: Consider optimization or incremental refresh

Records Processed:

Job: Hourly Data Sync
Records per run:
Average: 1,250
Median: 1,200
Min: 850
Max: 2,500
Total processed (last 30 days): 900,000 records
Growth rate: +5% per week

Volume Trends:

Records Processed (Last 7 days)
2500 ░░░░░░░░░░░░░░░░░░░░░█
2000 ░░░░░░░░░░░░░░░░░░░░░█
1500 ░░░░░░░█████████████░█
1000 █████████████████████░█
500 █████████████████████░█
Mon Tue Wed Thu Fri Sat Sun
Note: Spike on Sunday likely due to weekend promotions

Compare Jobs:

Metric | Sales DB | Marketing DB | Product DB
───────────────────────────────────────────────────────────
Success Rate | 100% | 95% | 90%
Avg Duration | 2m 15s | 5m 30s | 8m 45s
Avg Records | 1,200 | 3,500 | 500
Resource Usage (CPU) | Low | Medium | High

Compare Time Periods:

Metric | This Week | Last Week | Change
─────────────────────────────────────────────────
Success Rate | 98% | 100% | -2%
Avg Duration | 2m 30s | 2m 15s | +11%
Total Runs | 168 | 168 | 0
Failed Runs | 3 | 0 | +3

Stay informed about automation health with proactive alerts.

Failure Alerts:

Alert: Immediate notification when job fails
Trigger: Status = Failed
Channel: Email + Slack
Recipients: On-call engineer, team lead
Frequency: Immediate (for each failure)
Example:
Subject: [ALERT] Hourly Data Sync Failed
Message: The Hourly Data Sync job failed at 3:00 PM EST.
Error: Connection timeout to data source
Retry: Will retry in 5 minutes (attempt 1 of 3)

Performance Alerts:

Alert: Job taking longer than expected
Trigger: Duration > p95 (95th percentile)
Channel: Email
Recipients: Engineering team
Frequency: Once per day (digest)
Example:
Subject: [WARNING] Daily Sales Dashboard Running Slow
Message: Today's dashboard refresh took 4m 30s, compared to
average of 2m 15s. This may indicate data growth or performance
degradation.

Threshold Alerts:

Alert: Success rate drops below threshold
Trigger: Success rate < 95% (7-day window)
Channel: Email + PagerDuty
Recipients: Engineering manager, DevOps
Frequency: Once when threshold crossed, daily while below
Example:
Subject: [CRITICAL] Automation Success Rate Below Threshold
Message: The 7-day success rate has dropped to 92.5%, below the
95% threshold. 15 of 200 runs have failed in the past week.
Primary cause: Database connection timeouts (60% of failures)

Anomaly Alerts:

Alert: Unusual pattern detected
Trigger: AI-detected anomaly
Channel: Email
Recipients: Data team
Frequency: When anomaly confidence > 90%
Example:
Subject: [ANOMALY] Unusual Record Count in Data Sync
Message: Today's data sync processed 5,200 records, significantly
higher than the typical range of 1,000-1,500. This may indicate:
- Data quality issue (duplicates)
- Unexpected increase in activity
- System error
Please investigate.

Email:

Configuration:
To: engineering@company.com
CC: manager@company.com
Subject template: [${severity}] ${job_name} ${status}
Include: Error details, logs, retry information
Attach: Execution summary report

Slack:

Configuration:
Channel: #data-alerts
Mention: @on-call-engineer (for critical alerts)
Format: Rich message with action buttons
Thread: Group related alerts in threads
Example message:
🔴 ALERT: Hourly Data Sync Failed
Job: Hourly Data Sync
Time: 3:00 PM EST
Error: Connection timeout
Retry: In 5 minutes (1/3)
[View Logs] [Retry Now] [Disable Job]

Microsoft Teams:

Configuration:
Team: Engineering
Channel: Automation Alerts
Format: Adaptive card
Include: Quick actions

SMS (for critical alerts only):

Configuration:
Recipients: On-call phone numbers
Triggers: Critical failures only
Rate limit: Max 3 SMS per hour
Example:
CRITICAL: Daily Sales Dashboard failed after 3 retries at 8:15 AM.
Dashboards not updated. View: https://app.querri.com/jobs/12345

Webhook (for custom integrations):

Configuration:
URL: https://your-system.com/webhook
Method: POST
Headers: Authorization: Bearer ${token}
Payload:
{
"job_id": "12345",
"job_name": "Daily Sales Dashboard",
"status": "failed",
"timestamp": "2024-01-15T08:15:00Z",
"error": "Connection timeout",
"retry_count": 3,
"logs_url": "https://app.querri.com/logs/12345"
}

Severity Levels:

Info: FYI, no action needed
- Job completed successfully after retry
- Longer than average execution time
- Daily/weekly digest summaries
Warning: Monitor situation
- Single failure (will retry)
- Performance degradation
- Data volume anomaly
Error: Action recommended
- Multiple consecutive failures
- Success rate below threshold
- Resource usage high
Critical: Immediate action required
- All retries exhausted
- Critical pipeline failed
- Data integrity issue
- Scheduled report not sent

Alert Schedules:

Immediate: Send as soon as condition is met
- Critical failures
- Security issues
Batched: Collect and send periodically
- Performance warnings (hourly digest)
- Info messages (daily digest)
Throttled: Limit frequency to prevent spam
- Max 1 alert per 15 minutes for same job
- Combine repeated failures into single alert
- Escalate after threshold (3 alerts → page manager)

Alert Rules:

Rule: Failed Job Alert
When: Job status = Failed
And: Retry count = Max retries
Send: Immediate
Channel: Email + Slack
Severity: Error
Rule: Performance Degradation
When: Avg duration (7 days) > p95 (30 days)
And: Trend is increasing
Send: Daily digest
Channel: Email
Severity: Warning
Rule: Success Rate Alert
When: Success rate (7 days) < 95%
Send: Once when crossed + daily while below
Channel: Email + PagerDuty
Severity: Critical

When automation fails, systematic troubleshooting helps resolve issues quickly.

Step 1: Check the Error Message

Common error patterns:
"Connection timeout"
→ Network issue or database overload
→ Check: Database status, network connectivity, firewall rules
"Authentication failed"
→ Credentials expired or incorrect
→ Check: API keys, database passwords, token expiration
"Query returned no results"
→ Data source empty or filter too restrictive
→ Check: Data source, query filters, date ranges
"Out of memory"
→ Processing too much data at once
→ Check: Record count, data size, memory limits
"Rate limit exceeded"
→ Too many requests to API
→ Check: Request frequency, API quotas, throttling

Step 2: Review Execution Logs

Look for:
- Last successful step before failure
- Warning messages before error
- Resource usage spikes
- Timing of failure (always same time? random?)
Example log analysis:
[08:00:00] Starting dashboard refresh...
[08:00:01] Connecting to database...
[08:00:02] Connection established
[08:00:02] Executing query...
[08:00:45] Warning: Query taking longer than usual
[08:01:30] Warning: High memory usage (512 MB)
[08:02:15] Error: Out of memory ← Failure point
Diagnosis: Query returning too much data, causing OOM
Solution: Add LIMIT clause or use incremental refresh

Step 3: Compare with Successful Runs

What's different about failed runs?
Successful run (yesterday):
- Duration: 2m 15s
- Records: 1,200
- Memory: 128 MB
Failed run (today):
- Duration: Failed at 2m 15s
- Records: 5,800 (4.8x higher!)
- Memory: 512 MB → OOM
Root cause: Unexpected spike in data volume

Connection Issues:

Problem: "Connection timeout to database"
Troubleshooting:
1. Check if database is running
→ Try connecting manually
2. Verify network connectivity
→ Ping database server
3. Check firewall rules
→ Ensure Querri IP is whitelisted
4. Verify credentials
→ Test with database client
5. Check connection pool
→ May be exhausted during peak times
Solutions:
- Increase connection timeout
- Add database to firewall whitelist
- Rotate credentials if expired
- Schedule during off-peak hours
- Increase connection pool size

Query Errors:

Problem: "Invalid column name 'customer_id'"
Troubleshooting:
1. Check if schema changed
→ Compare with previous runs
2. Verify query syntax
→ Test query in SQL client
3. Check table/view exists
→ List tables in database
Solutions:
- Update query to use correct column name
- Add schema validation step
- Use aliases if table structure changed
- Set up schema change alerts

Data Quality Issues:

Problem: "Data validation failed: null values in required field"
Troubleshooting:
1. Check data source
→ Verify upstream process completed
2. Review data quality
→ Sample recent records
3. Check data pipeline
→ Verify transformation steps
Solutions:
- Add data quality checks earlier in pipeline
- Set up alerts for upstream failures
- Use default values for missing data
- Add data cleaning step

Resource Limitations:

Problem: "Job exceeded maximum execution time (10 minutes)"
Troubleshooting:
1. Identify slow steps
→ Review execution timeline
2. Check data volume
→ Compare with historical runs
3. Analyze query performance
→ Check execution plans
Solutions:
- Optimize slow queries
- Add database indexes
- Use incremental instead of full refresh
- Increase timeout limit
- Split into multiple smaller jobs

When to Retry Immediately:

Transient errors that may resolve quickly:
- Network timeout
- Database temporarily unavailable
- Rate limit (with backoff)
Configuration:
Retry: Immediately with exponential backoff
Max attempts: 3
Delay: 1min, 2min, 4min

When to Wait Before Retry:

Issues that need time to resolve:
- Downstream service maintenance
- Scheduled database backup
- Known outage window
Configuration:
Retry: After maintenance window
Max attempts: 1
Delay: Until 10:00 AM (after backup completes)

When Not to Retry:

Permanent errors that won't resolve automatically:
- Invalid credentials
- Syntax errors in query
- Schema mismatch
- Permission denied
Action:
Send alert immediately
Require manual fix
Don't retry automatically

Fixing and Rerunning:

Steps:
1. Identify root cause from logs
2. Fix the underlying issue
- Update credentials
- Optimize query
- Increase resources
3. Test the fix manually
- Run job manually
- Verify it succeeds
4. Re-enable automation
5. Monitor next few runs closely

Partial Recovery:

If job partially completed:
1. Identify last successful checkpoint
2. Note which data was processed
3. Modify job to resume from checkpoint
4. Run manually to complete
5. Verify data integrity
Example:
Job: Load 100,000 records
Failed at: Record 75,432
Recovery:
- Add WHERE id > 75,432 to query
- Run manually to process remaining records
- Verify total count = 100,000
- Re-enable normal automation

Create a centralized view of automation health:

Overall Health:

┌─────────────────────────────┐
│ Automation Health │
│ │
│ Success Rate (7d): 98.5% ✓ │
│ Active Jobs: 25 │
│ Running Now: 3 │
│ Failed Today: 1 │
└─────────────────────────────┘

Recent Failures:

┌─────────────────────────────┐
│ Recent Failures │
│ │
│ • Hourly Sync - 3:00 PM │
│ Connection timeout │
│ • Monthly Report - 9:00 AM │
│ Query error │
└─────────────────────────────┘

Execution Timeline:

Timeline (Last 24 hours)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✓✓✓✓✗✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓
00:00 12:00 24:00
✓ = Success, ✗ = Failed

Performance Trends:

Avg Execution Time (30 days)
3m ██░░░░░░░░░░░░░░░░░░░░
2m ████████████████████░░
1m ████████████████████░░
Week1 Week2 Week3 Week4
  1. Set up alerts early: Don’t wait for problems to occur
  2. Monitor trends: Look for gradual degradation, not just failures
  3. Review logs regularly: Weekly review of execution history
  4. Test alert delivery: Ensure notifications reach the right people
  5. Document common issues: Build a troubleshooting playbook
  6. Automate recovery when possible: Use retry logic and fallbacks
  7. Track changes: Log all modifications to automations
  8. Regular audits: Monthly review of all active automations
  9. Clean up old jobs: Remove obsolete automations
  10. Share learnings: Document and share solutions with team

Build reliable, monitored automation that you can trust!