Health Monitoring

Comprehensive health monitoring and diagnostics for your ViewAI deployment.

Overview

ViewAI's health monitoring system helps you:

Check API connectivity and authentication
Monitor service availability and performance
Measure latency and response times
Test network reliability
Run comprehensive diagnostics
Set up automated health checks

HealthChecker Service

The HealthChecker service provides comprehensive health monitoring capabilities for your ViewAI integration.

Initialization

example.py

from viewai_client import ViewAIClient

# Initialize client (health checker is included)
client = ViewAIClient(api_key="your-api-key")

# Access health checker
health = client.health

Basic Health Checks

Connection Check

Test basic connectivity to the ViewAI API:

connection_check.py

# Check API connection
result = client.health.check_connection()

if result.healthy:
    print(f"Connected successfully")
    print(f"Response time: {result.response_time_ms:.2f}ms")
    print(f"Status code: {result.status_code}")
    print(f"Details: {result.details}")
else:
    print(f"Connection failed: {result.message}")
    print(f"Error details: {result.details}")

Authentication Check

Verify API key and authentication:

authentication_check.py

# Check authentication
auth_result = client.health.check_authentication()

if auth_result.healthy:
    print("Authentication successful")
    print(f"Response time: {auth_result.response_time_ms:.2f}ms")
    print(f"Workspace count: {auth_result.details.get('workspace_count', 0)}")
else:
    print(f"Authentication failed: {auth_result.message}")

HealthCheckResult Object

All health check methods return a HealthCheckResult object with:

health_check_result.py

@dataclass
class HealthCheckResult:
    healthy: bool                    # Overall health status
    status_code: Optional[int]       # HTTP status code
    response_time_ms: Optional[float]  # Response time in milliseconds
    message: str                     # Status message
    details: Dict                    # Additional details

Example usage:

example_usage.py

result = client.health.check_connection()

print(f"Healthy: {result.healthy}")
print(f"Status: {result.status_code}")
print(f"Response time: {result.response_time_ms}ms")
print(f"Message: {result.message}")
print(f"Details: {result.details}")

Comprehensive Diagnostics

Running Diagnostics

Run a complete diagnostic suite:

run_diagnostics.py

# Run all diagnostics
diagnostics = client.health.run_diagnostics()

# Check results
for check_name, result in diagnostics.items():
    status = "✓" if result.healthy else "✗"
    print(f"{status} {check_name}: {result.message}")

    if result.response_time_ms:
        print(f"  Response time: {result.response_time_ms:.2f}ms")

# Overall status
all_healthy = all(result.healthy for result in diagnostics.values())
print(f"\nOverall status: {'HEALTHY' if all_healthy else 'UNHEALTHY'}")

Diagnostic Checks

Connection

Basic API connectivity

Authentication

API key validation

Workspaces

Workspace endpoint availability

Projects

Projects endpoint availability

Checking Specific Endpoints

Test individual API endpoints:

check_api_endpoints.py

# Check multiple endpoints
endpoint_results = client.health.check_api_endpoints()

for endpoint_name, result in endpoint_results.items():
    if result.healthy:
        print(f"✓ {endpoint_name}: OK ({result.response_time_ms:.2f}ms)")
    else:
        print(f"✗ {endpoint_name}: {result.message}")

Performance Monitoring

Measuring Latency

Measure API latency with multiple requests:

measure_latency.py

# Measure latency over 10 requests
latency_stats = client.health.measure_latency(num_requests=10)

print("Latency Statistics:")
print(f"  Minimum: {latency_stats['min_ms']:.2f}ms")
print(f"  Maximum: {latency_stats['max_ms']:.2f}ms")
print(f"  Average: {latency_stats['avg_ms']:.2f}ms")
print(f"  Median: {latency_stats['median_ms']:.2f}ms")
print(f"  Requests: {latency_stats['num_requests']}")

Performance Thresholds

Set up alerts based on performance thresholds:

performance_thresholds.py

# Define performance thresholds
LATENCY_WARNING = 500   # ms
LATENCY_CRITICAL = 1000  # ms

latency_stats = client.health.measure_latency(num_requests=5)
avg_latency = latency_stats['avg_ms']

if avg_latency > LATENCY_CRITICAL:
    print(f"CRITICAL: Average latency {avg_latency:.2f}ms exceeds threshold")
elif avg_latency > LATENCY_WARNING:
    print(f"WARNING: Average latency {avg_latency:.2f}ms exceeds warning threshold")
else:
    print(f"OK: Average latency {avg_latency:.2f}ms is within acceptable range")

Network Reliability Testing

Test network reliability with repeated requests:

network_reliability.py

# Test reliability over 20 attempts
reliability = client.health.test_network_reliability(num_attempts=20)

print("Network Reliability:")
print(f"  Total attempts: {reliability['total_attempts']}")
print(f"  Successful: {reliability['successful']}")
print(f"  Failed: {reliability['failed']}")
print(f"  Success rate: {reliability['success_rate_pct']:.1f}%")

if reliability['errors']:
    print("\nRecent errors:")
    for error in reliability['errors']:
        print(f"  - {error}")

Automated Monitoring

Simple Monitoring Loop

monitor_health.py

import time
from datetime import datetime

def monitor_health(duration_seconds=300, interval_seconds=30):
    """Monitor health for specified duration."""
    start_time = time.time()
    checks = []

    while time.time() - start_time < duration_seconds:
        # Perform health check
        result = client.health.check_connection()

        check_info = {
            'timestamp': datetime.now(),
            'healthy': result.healthy,
            'response_time': result.response_time_ms,
            'status_code': result.status_code
        }

        checks.append(check_info)

        # Log result
        status = "✓" if result.healthy else "✗"
        print(f"{status} {check_info['timestamp']}: {check_info['response_time']:.2f}ms")

        time.sleep(interval_seconds)

    # Calculate statistics
    healthy_count = sum(1 for c in checks if c['healthy'])
    uptime_pct = (healthy_count / len(checks)) * 100

    response_times = [c['response_time'] for c in checks if c['response_time']]
    avg_response = sum(response_times) / len(response_times) if response_times else 0

    print(f"\nMonitoring Summary:")
    print(f"  Total checks: {len(checks)}")
    print(f"  Healthy checks: {healthy_count}")
    print(f"  Uptime: {uptime_pct:.1f}%")
    print(f"  Average response time: {avg_response:.2f}ms")

    return checks

# Run monitoring
checks = monitor_health(duration_seconds=300, interval_seconds=30)

Advanced Monitoring with Alerts

health_monitor.py

import time
from datetime import datetime

class HealthMonitor:
    """Advanced health monitoring with alerts."""

    def __init__(self, client, alert_callback=None):
        self.client = client
        self.alert_callback = alert_callback or self.default_alert
        self.history = []

    def default_alert(self, alert_type, message):
        """Default alert handler."""
        print(f"ALERT [{alert_type}]: {message}")

    def check_health(self):
        """Perform comprehensive health check."""
        result = self.client.health.check_connection()
        auth_result = self.client.health.check_authentication()

        return {
            'timestamp': datetime.now(),
            'connection': result,
            'authentication': auth_result
        }

    def analyze_check(self, check_result):
        """Analyze health check and trigger alerts."""
        conn = check_result['connection']
        auth = check_result['authentication']

        # Check connection
        if not conn.healthy:
            self.alert_callback('CRITICAL', f'Connection failed: {conn.message}')

        # Check authentication
        if not auth.healthy:
            self.alert_callback('CRITICAL', f'Authentication failed: {auth.message}')

        # Check latency
        if conn.response_time_ms and conn.response_time_ms > 1000:
            self.alert_callback('WARNING', f'High latency: {conn.response_time_ms:.2f}ms')

    def monitor(self, duration_seconds=300, interval_seconds=30):
        """Run monitoring loop."""
        start_time = time.time()

        while time.time() - start_time < duration_seconds:
            check_result = self.check_health()
            self.history.append(check_result)
            self.analyze_check(check_result)

            time.sleep(interval_seconds)

    def get_statistics(self):
        """Calculate monitoring statistics."""
        if not self.history:
            return {}

        connection_checks = [h['connection'] for h in self.history]
        healthy_count = sum(1 for c in connection_checks if c.healthy)

        response_times = [
            c.response_time_ms for c in connection_checks
            if c.response_time_ms
        ]

        return {
            'total_checks': len(self.history),
            'healthy_checks': healthy_count,
            'uptime_pct': (healthy_count / len(self.history)) * 100,
            'avg_response_time': sum(response_times) / len(response_times) if response_times else 0,
            'min_response_time': min(response_times) if response_times else 0,
            'max_response_time': max(response_times) if response_times else 0
        }

# Usage
monitor = HealthMonitor(client)

# Run monitoring
monitor.monitor(duration_seconds=300, interval_seconds=30)

# Get statistics
stats = monitor.get_statistics()
print(f"Uptime: {stats['uptime_pct']:.1f}%")
print(f"Average response time: {stats['avg_response_time']:.2f}ms")

Health Reporting

Generating Health Reports

generate_health_report.py

from datetime import datetime

def generate_health_report(client):
    """Generate comprehensive health report."""

    report = {
        'timestamp': datetime.now().isoformat(),
        'checks': {}
    }

    # Connection check
    conn_result = client.health.check_connection()
    report['checks']['connection'] = {
        'healthy': conn_result.healthy,
        'response_time_ms': conn_result.response_time_ms,
        'status_code': conn_result.status_code,
        'message': conn_result.message
    }

    # Authentication check
    auth_result = client.health.check_authentication()
    report['checks']['authentication'] = {
        'healthy': auth_result.healthy,
        'response_time_ms': auth_result.response_time_ms,
        'message': auth_result.message
    }

    # Latency measurement
    latency = client.health.measure_latency(num_requests=5)
    report['performance'] = latency

    # Network reliability
    reliability = client.health.test_network_reliability(num_attempts=10)
    report['reliability'] = reliability

    # Overall status
    report['overall_healthy'] = all(
        check['healthy'] for check in report['checks'].values()
    )

    return report

# Generate report
report = generate_health_report(client)

# Display report
print("Health Report")
print("=" * 50)
print(f"Timestamp: {report['timestamp']}")
print(f"Overall Status: {'HEALTHY' if report['overall_healthy'] else 'UNHEALTHY'}")

print("\nChecks:")
for check_name, check_data in report['checks'].items():
    status = "✓" if check_data['healthy'] else "✗"
    print(f"  {status} {check_name}: {check_data['message']}")
    if check_data.get('response_time_ms'):
        print(f"     Response time: {check_data['response_time_ms']:.2f}ms")

print(f"\nPerformance:")
print(f"  Average latency: {report['performance']['avg_ms']:.2f}ms")
print(f"  Min latency: {report['performance']['min_ms']:.2f}ms")
print(f"  Max latency: {report['performance']['max_ms']:.2f}ms")

print(f"\nReliability:")
print(f"  Success rate: {report['reliability']['success_rate_pct']:.1f}%")
print(f"  Successful requests: {report['reliability']['successful']}")
print(f"  Failed requests: {report['reliability']['failed']}")

Exporting Reports

export_health_report.py

import json

def export_health_report(report, filename):
    """Export health report to JSON file."""
    with open(filename, 'w') as f:
        json.dump(report, f, indent=2, default=str)

    print(f"Report exported to {filename}")

# Export report
report = generate_health_report(client)
export_health_report(report, 'health_report.json')

Integration Examples

Flask Integration

flask_integration.py

from flask import Flask, jsonify
from viewai_client import ViewAIClient

app = Flask(__name__)
client = ViewAIClient(api_key="your-api-key")

@app.route('/health')
def health_check():
    """Health check endpoint."""
    result = client.health.check_connection()

    return jsonify({
        'healthy': result.healthy,
        'response_time_ms': result.response_time_ms,
        'message': result.message
    }), 200 if result.healthy else 503

@app.route('/health/detailed')
def detailed_health():
    """Detailed health check endpoint."""
    diagnostics = client.health.run_diagnostics()

    return jsonify({
        'checks': {
            name: {
                'healthy': result.healthy,
                'message': result.message,
                'response_time_ms': result.response_time_ms
            }
            for name, result in diagnostics.items()
        },
        'overall_healthy': all(r.healthy for r in diagnostics.values())
    })

if __name__ == '__main__':
    app.run(port=5000)

FastAPI Integration

fastapi_integration.py

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from viewai_client import ViewAIClient

app = FastAPI()
client = ViewAIClient(api_key="your-api-key")

class HealthResponse(BaseModel):
    healthy: bool
    response_time_ms: float
    message: str

@app.get("/health", response_model=HealthResponse)
async def health_check():
    """Health check endpoint."""
    result = client.health.check_connection()

    if not result.healthy:
        raise HTTPException(status_code=503, detail=result.message)

    return {
        'healthy': result.healthy,
        'response_time_ms': result.response_time_ms,
        'message': result.message
    }

@app.get("/health/diagnostics")
async def diagnostics():
    """Comprehensive diagnostics endpoint."""
    results = client.health.run_diagnostics()

    return {
        'checks': {
            name: {
                'healthy': result.healthy,
                'message': result.message,
                'response_time_ms': result.response_time_ms
            }
            for name, result in results.items()
        },
        'overall_healthy': all(r.healthy for r in results.values())
    }

Scheduled Monitoring

scheduled_monitoring.py

from apscheduler.schedulers.background import BackgroundScheduler
from datetime import datetime

def scheduled_health_check():
    """Scheduled health check function."""
    result = client.health.check_connection()

    print(f"[{datetime.now()}] Health check: {'OK' if result.healthy else 'FAILED'}")

    if not result.healthy:
        # Send alert (email, Slack, etc.)
        send_alert(f"Health check failed: {result.message}")

# Set up scheduler
scheduler = BackgroundScheduler()
scheduler.add_job(
    scheduled_health_check,
    'interval',
    minutes=5,  # Run every 5 minutes
    id='health_check'
)

scheduler.start()

Best Practices

Regular Monitoring

Check health at regular intervals (every 5-10 minutes)

Multiple Metrics

Monitor connection, authentication, and performance

Alerting

Set up alerts for failures and performance degradation

Thresholds

Define clear thresholds for warnings and critical alerts

Logging

Log all health check results for historical analysis

Response Times

Monitor response times to detect performance issues early

Network Reliability

Track success rates to identify intermittent issues

Comprehensive Checks

Use run_diagnostics() for thorough testing

Automated Recovery

Implement retry logic for transient failures

Documentation

Document health check procedures and thresholds

Monitoring Strategies

Development Environment

dev_monitoring.py

# Frequent checks with verbose output
def dev_monitoring():
    """Development environment monitoring."""
    diagnostics = client.health.run_diagnostics()

    for check_name, result in diagnostics.items():
        print(f"{check_name}: {result.message}")
        if result.details:
            print(f"  Details: {result.details}")

Staging Environment

staging_monitoring.py

# Regular checks with alerting
def staging_monitoring():
    """Staging environment monitoring."""
    result = client.health.check_connection()
    auth_result = client.health.check_authentication()

    if not result.healthy or not auth_result.healthy:
        send_alert("Staging health check failed")

    log_health_status(result, auth_result)

Production Environment

production_monitoring.py

# Comprehensive monitoring with metrics
def production_monitoring():
    """Production environment monitoring."""
    # Run comprehensive diagnostics
    diagnostics = client.health.run_diagnostics()

    # Measure performance
    latency = client.health.measure_latency(num_requests=10)

    # Test reliability
    reliability = client.health.test_network_reliability(num_attempts=20)

    # Send metrics to monitoring system
    send_metrics({
        'diagnostics': diagnostics,
        'latency': latency,
        'reliability': reliability
    })

    # Alert on failures
    if not all(r.healthy for r in diagnostics.values()):
        trigger_incident("ViewAI health check failed")

Troubleshooting

Connection Failures

Check connection status

result = client.health.check_connection()

if not result.healthy:
    print("Troubleshooting connection failure:")
    print(f"1. Check API endpoint: {client.http_client.base_url}")
    print(f"2. Verify network connectivity")
    print(f"3. Check firewall rules")
    print(f"4. Review error: {result.message}")

Authentication Failures

Verify authentication

auth_result = client.health.check_authentication()

if not auth_result.healthy:
    print("Troubleshooting authentication failure:")
    print(f"1. Verify API key is correct")
    print(f"2. Check API key expiration")
    print(f"3. Confirm API key permissions")
    print(f"4. Review error: {auth_result.message}")

High Latency

Diagnose high latency

latency = client.health.measure_latency(num_requests=10)

if latency['avg_ms'] > 500:
    print("High latency detected:")
    print(f"Average: {latency['avg_ms']:.2f}ms")
    print(f"Max: {latency['max_ms']:.2f}ms")
    print("Recommendations:")
    print("1. Check network connection")
    print("2. Verify server load")
    print("3. Consider geographic proximity")
    print("4. Review timeout settings")

Next Steps

Learn about Model Registry
Explore Version Management
Review Best Practices

PreviousError Handling NextModel Registry

Was this helpful?

hashtagOverview

hashtagHealthChecker Service

hashtagInitialization

hashtagBasic Health Checks

hashtagConnection Check

hashtagAuthentication Check

hashtagHealthCheckResult Object

hashtagComprehensive Diagnostics

hashtagRunning Diagnostics

hashtagDiagnostic Checks

hashtagConnection

hashtagAuthentication

hashtagWorkspaces

hashtagProjects

hashtagChecking Specific Endpoints

hashtagPerformance Monitoring

hashtagMeasuring Latency

hashtagPerformance Thresholds

hashtagNetwork Reliability Testing

hashtagAutomated Monitoring

hashtagSimple Monitoring Loop

hashtagAdvanced Monitoring with Alerts

hashtagHealth Reporting

hashtagGenerating Health Reports

hashtagExporting Reports

hashtagIntegration Examples

hashtagFlask Integration

hashtagFastAPI Integration

hashtagScheduled Monitoring

hashtagBest Practices

hashtagRegular Monitoring

hashtagMultiple Metrics

hashtagAlerting

hashtagThresholds

hashtagLogging

hashtagResponse Times

hashtagNetwork Reliability

hashtagComprehensive Checks

hashtagAutomated Recovery

hashtagDocumentation

hashtagMonitoring Strategies

hashtagDevelopment Environment

hashtagStaging Environment

hashtagProduction Environment

hashtagTroubleshooting

hashtagConnection Failures

hashtagCheck connection status

hashtagAuthentication Failures

hashtagVerify authentication

hashtagHigh Latency

hashtagDiagnose high latency

hashtagNext Steps