Monitoring & Usage

This guide covers monitoring Querri’s health, tracking usage metrics, managing resources, and analyzing performance.

Usage Tracking Interface

Accessing Usage Metrics

Usage dashboard is available at:

https://app.yourcompany.com/settings/usage

Requirements: Admin access

Usage Dashboard

The usage page displays:

User activity - Active users, login frequency
Resource usage - Projects, dashboards, files
Storage metrics - Total storage, per-user breakdown
API usage - API calls, rate limits
AI usage - Token consumption, model usage
Integration activity - Sync frequency, data volume

Service Health Monitoring

Health Check Endpoints

Querri provides multiple health check endpoints:

External Health Check

# Healthz service (external monitoring)
curl http://localhost:8180/healthz

Response:

{
  "status": "healthy",
  "timestamp": "2024-01-15T10:00:00Z",
  "services": {
    "web-app": "up",
    "server-api": "up",
    "hub": "up",
    "mongo": "up",
    "redis": "up"
  }
}

Hub Service Health

# Hub internal health check
curl http://localhost:8888/hub/healthz

API Health

# Server API health
curl http://localhost:8888/api/health

Service Status via Docker

# View all service status
docker compose ps

# Check specific service health
docker inspect --format='{{.State.Health.Status}}' querri-hub

# View service logs
docker compose logs --tail=100 -f server-api

Automated Health Monitoring

Set up automated health checks with monitoring tools:

Uptime Monitoring (Uptime Robot, Pingdom)

Configure HTTP monitor:

URL: https://app.yourcompany.com/api/health
Interval: 5 minutes
Expected: Status 200
Alert on: Status != 200 or response time > 5s

Nagios/Icinga Check

#!/bin/bash
RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8180/healthz)

if [ "$RESPONSE" -eq 200 ]; then
    echo "OK - Querri is healthy"
    exit 0
else
    echo "CRITICAL - Querri health check failed (HTTP $RESPONSE)"
    exit 2
fi

Resource Monitoring

MongoDB Monitoring

Database Metrics

# Connect to MongoDB
docker compose exec mongo mongosh -u querri -p

# Database statistics
use querri
db.stats()

# Collection sizes
db.projects.stats()
db.files.stats()
db.users.stats()

# Current operations
db.currentOp()

Storage Usage

// Total database size
db.stats().dataSize

// Storage by collection
db.getCollectionNames().forEach(function(collection) {
    var stats = db[collection].stats();
    print(collection + ": " + (stats.size / 1024 / 1024).toFixed(2) + " MB");
});

// File storage usage
db.files.aggregate([
    {
        $group: {
            _id: null,
            total_size_mb: {
                $sum: {$divide: ["$size", 1024 * 1024]}
            },
            file_count: {$sum: 1}
        }
    }
])

Index Performance

// List indexes
db.projects.getIndexes()

// Index statistics
db.projects.aggregate([{$indexStats: {}}])

// Slow queries (enable profiling first)
db.setProfilingLevel(1, {slowms: 100})  // Log queries > 100ms
db.system.profile.find().sort({ts: -1}).limit(10)

Redis Monitoring

# Connect to Redis
docker compose exec redis redis-cli

# Server information
INFO

# Memory usage
INFO memory

# Key statistics
INFO keyspace

# List all keys (use with caution in production)
KEYS *

# Monitor real-time commands
MONITOR

Redis Memory Analysis

# Memory stats
redis-cli INFO memory | grep used_memory_human

# Largest keys
redis-cli --bigkeys

# Key expiration info
redis-cli TTL session:user@company.com

Container Resource Usage

# Docker container stats
docker stats --no-stream

# Specific service resources
docker stats querri-server-api --no-stream

# Detailed container metrics
docker inspect querri-server-api | grep -A 20 "State"

Disk Space Monitoring

# MongoDB volume usage
docker volume inspect querri_mongodb_data

# Docker system disk usage
docker system df -v

# Container disk usage
du -sh /var/lib/docker/volumes/querri_mongodb_data

Log Aggregation

Service Logs

Viewing Logs

# All services
docker compose logs

# Specific service
docker compose logs server-api

# Follow logs in real-time
docker compose logs -f --tail=100 server-api

# Filter by time
docker compose logs --since 1h server-api

# Multiple services
docker compose logs web-app server-api hub

Log Levels

Configure log verbosity in .env-prod:

# Application log level
LOG_LEVEL=INFO  # DEBUG, INFO, WARNING, ERROR, CRITICAL

For Traefik:

log:
  level: INFO  # DEBUG, INFO, WARN, ERROR
accessLog:
  enabled: true

Centralized Logging

Option 1: ELK Stack (Elasticsearch, Logstash, Kibana)

docker-compose.logging.yml:

elasticsearch:
  image: elasticsearch:8.11.0
  environment:
    - discovery.type=single-node
  ports:
    - "9200:9200"

logstash:
  image: logstash:8.11.0
  volumes:
    - ./logstash/pipeline:/usr/share/logstash/pipeline
  depends_on:
    - elasticsearch

kibana:
  image: kibana:8.11.0
  ports:
    - "5601:5601"
  depends_on:
    - elasticsearch

logstash/pipeline/docker.conf:

input {
  gelf {
    port => 12201
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "querri-logs-%{+YYYY.MM.dd}"
  }
}

Update main docker-compose.yml:

services:
  server-api:
    logging:
      driver: gelf
      options:
        gelf-address: "udp://localhost:12201"
        tag: "server-api"

Option 2: Loki (Lightweight)

loki:
  image: grafana/loki:2.9.0
  ports:
    - "3100:3100"

promtail:
  image: grafana/promtail:2.9.0
  volumes:
    - /var/lib/docker/containers:/var/lib/docker/containers:ro
    - ./promtail-config.yml:/etc/promtail/config.yml

Application Logging

Structured Logging

Python services use structured JSON logging:

import logging
import json

logger = logging.getLogger(__name__)

logger.info(
    "Project created",
    extra={
        "user_email": "user@company.com",
        "project_id": "abc123",
        "organization_id": "org_123"
    }
)

Log Retention

Configure log rotation:

# Docker log rotation
{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}

# Restart Docker daemon
sudo systemctl restart docker

Performance Metrics

Application Performance Monitoring (APM)

Sentry Integration

Configure Sentry for error tracking:

PUBLIC_SENTRY_ORG_ID="your_org_id"
PUBLIC_SENTRY_PROJECT_ID="your_project_id"
PUBLIC_SENTRY_KEY="your_sentry_key"
SENTRY_AUTH_TOKEN="your_auth_token"

View errors in Sentry Dashboard:

Error frequency and trends
Stack traces
User impact
Performance issues

Custom Metrics

Track custom application metrics:

# Track AI usage
db.metrics.insertOne({
    "metric": "ai_tokens_used",
    "value": 1247,
    "user_email": "user@company.com",
    "model": "gpt-4o",
    "timestamp": datetime.now()
})

# Track API latency
db.metrics.insertOne({
    "metric": "api_latency_ms",
    "value": 234,
    "endpoint": "/api/projects",
    "method": "GET",
    "timestamp": datetime.now()
})

Database Performance

Query Performance

// Enable profiling
db.setProfilingLevel(2)  // Log all queries

// View slow queries
db.system.profile.find({millis: {$gt: 100}}).sort({ts: -1}).limit(10)

// Explain query performance
db.projects.find({user_email: "user@company.com"}).explain("executionStats")

Connection Pool Monitoring

# MongoDB connection pool stats
from pymongo import MongoClient

client = MongoClient(connection_string)
pool_stats = client.server_info()
print(f"Connections: {pool_stats['connections']}")

API Performance

Response Time Tracking

// Average API response time by endpoint
db.api_logs.aggregate([
    {
        $match: {
            timestamp: {
                $gte: new Date(Date.now() - 24*60*60*1000)
            }
        }
    },
    {
        $group: {
            _id: "$endpoint",
            avg_response_ms: {$avg: "$response_time_ms"},
            count: {$sum: 1}
        }
    },
    {
        $sort: {avg_response_ms: -1}
    }
])

Rate Limit Monitoring

// API calls per user (last 24 hours)
db.api_logs.aggregate([
    {
        $match: {
            timestamp: {
                $gte: new Date(Date.now() - 24*60*60*1000)
            }
        }
    },
    {
        $group: {
            _id: "$user_email",
            api_calls: {$sum: 1}
        }
    },
    {
        $sort: {api_calls: -1}
    }
])

Usage Analytics

User Activity

Active Users

// Active users (last 30 days)
db.users.find({
    last_login: {
        $gte: new Date(Date.now() - 30*24*60*60*1000)
    }
}).count()

// Daily active users
db.audit_log.aggregate([
    {
        $match: {
            event_type: "authentication",
            timestamp: {
                $gte: new Date(Date.now() - 30*24*60*60*1000)
            }
        }
    },
    {
        $group: {
            _id: {
                $dateToString: {
                    format: "%Y-%m-%d",
                    date: "$timestamp"
                }
            },
            unique_users: {$addToSet: "$user_email"}
        }
    },
    {
        $project: {
            date: "$_id",
            user_count: {$size: "$unique_users"}
        }
    },
    {
        $sort: {date: -1}
    }
])

Feature Usage

// Most used features
db.audit_log.aggregate([
    {
        $match: {
            timestamp: {
                $gte: new Date(Date.now() - 30*24*60*60*1000)
            }
        }
    },
    {
        $group: {
            _id: "$action",
            count: {$sum: 1}
        }
    },
    {
        $sort: {count: -1}
    }
])

Resource Usage

Projects by User

// Projects per user
db.projects.aggregate([
    {
        $group: {
            _id: "$created_by",
            project_count: {$sum: 1}
        }
    },
    {
        $sort: {project_count: -1}
    }
])

Storage by User

// Storage usage per user
db.files.aggregate([
    {
        $group: {
            _id: "$uploaded_by",
            total_bytes: {$sum: "$size"},
            file_count: {$sum: 1}
        }
    },
    {
        $project: {
            user_email: "$_id",
            total_mb: {$divide: ["$total_bytes", 1024*1024]},
            file_count: 1
        }
    },
    {
        $sort: {total_mb: -1}
    }
])

AI Usage Tracking

// AI token usage by user
db.ai_usage.aggregate([
    {
        $match: {
            timestamp: {
                $gte: new Date(Date.now() - 30*24*60*60*1000)
            }
        }
    },
    {
        $group: {
            _id: "$user_email",
            total_tokens: {$sum: "$tokens"},
            requests: {$sum: 1},
            cost_usd: {$sum: "$cost_usd"}
        }
    },
    {
        $sort: {total_tokens: -1}
    }
])

Alerting

Alert Configuration

Set up alerts for critical conditions:

Disk Space Alert

#!/bin/bash
THRESHOLD=80
USAGE=$(df -h /var/lib/docker | awk 'NR==2 {print $5}' | sed 's/%//')

if [ $USAGE -gt $THRESHOLD ]; then
    echo "CRITICAL: Disk usage is ${USAGE}%"
    # Send alert (email, Slack, PagerDuty, etc.)
    curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL \
      -H 'Content-Type: application/json' \
      -d "{\"text\": \"🚨 Querri disk usage: ${USAGE}%\"}"
fi

Service Down Alert

#!/bin/bash
HEALTH=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8180/healthz)

if [ "$HEALTH" -ne 200 ]; then
    echo "CRITICAL: Health check failed (HTTP $HEALTH)"
    # Send alert
    curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL \
      -H 'Content-Type: application/json' \
      -d '{"text": "🚨 Querri health check failed"}'
fi

MongoDB Replication Lag

// Check replication lag (if using replica set)
rs.printSlaveReplicationInfo()

// Alert if lag > 10 seconds
const lag = rs.status().members[1].optimeDate - rs.status().members[0].optimeDate;
if (lag > 10000) {  // milliseconds
    // Send alert
}

Alert Channels

Email Alerts

# Send email alert
from api.utils.email import send_email

send_email(
    to="admin@company.com",
    subject="Querri Alert: High Disk Usage",
    body="Disk usage has exceeded 80%"
)

Slack Alerts

# Slack webhook
curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL \
  -H 'Content-Type: application/json' \
  -d '{"text": "🚨 Querri Alert: Service degraded"}'

PagerDuty Integration

import requests

def send_pagerduty_alert(message):
    payload = {
        "routing_key": "YOUR_INTEGRATION_KEY",
        "event_action": "trigger",
        "payload": {
            "summary": message,
            "severity": "error",
            "source": "querri-monitoring"
        }
    }
    requests.post(
        "https://events.pagerduty.com/v2/enqueue",
        json=payload
    )

Dashboards

Grafana Integration

Set up Grafana for visual monitoring:

grafana:
  image: grafana/grafana:10.0.0
  ports:
    - "3000:3000"
  volumes:
    - grafana-storage:/var/lib/grafana
    - ./grafana/dashboards:/etc/grafana/provisioning/dashboards

prometheus:
  image: prom/prometheus:v2.45.0
  ports:
    - "9090:9090"
  volumes:
    - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
    - prometheus-storage:/prometheus

Metrics Export

Export metrics to Prometheus:

# Prometheus metrics endpoint
from prometheus_client import Counter, Histogram, generate_latest

api_requests = Counter('api_requests_total', 'Total API requests')
api_latency = Histogram('api_latency_seconds', 'API latency')

@app.get("/metrics")
def metrics():
    return Response(generate_latest(), media_type="text/plain")

Troubleshooting Performance Issues

High CPU Usage

Identify process:
Terminal window
```
docker stats --no-stream
```

Check service logs:

docker compose logs server-api | grep -i error

Review query performance:

db.currentOp({"secs_running": {$gte: 5}})

Scale service:

SERVER_API_REPLICAS=8 docker compose up -d --scale server-api=8

High Memory Usage

Check container memory:

docker stats --format "table {{.Name}}\t{{.MemUsage}}"

Review MongoDB memory:
```
db.serverStatus().mem
```

Clear Redis cache:

docker compose exec redis redis-cli FLUSHALL

Slow Response Times

Check API latency:

db.api_logs.find().sort({response_time_ms: -1}).limit(10)

Review database indexes:
```
db.projects.getIndexes()
```

Analyze slow queries:

db.system.profile.find({millis: {$gt: 100}})

Best Practices

Set up automated monitoring - Don’t rely on manual checks
Establish baselines - Know normal performance metrics
Alert on trends - Catch issues before they become critical
Regular reviews - Weekly review of metrics and logs
Document incidents - Keep record of issues and resolutions
Capacity planning - Monitor growth and plan for scaling
Test alerts - Ensure alert system is working

Next Steps

Backup & Maintenance - Backup and recovery procedures
Troubleshooting - Common issues and solutions
Security & Permissions - Monitor security events
AI Tuning - Optimize AI performance and costs