> For the complete documentation index, see [llms.txt](https://docs.viewai.ca/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.viewai.ca/mlops/health-monitoring.md).

# Health Monitoring

Comprehensive health monitoring and diagnostics for your ViewAI deployment.

## Overview

ViewAI's health monitoring system helps you:

* Check API connectivity and authentication
* Monitor service availability and performance
* Measure latency and response times
* Test network reliability
* Run comprehensive diagnostics
* Set up automated health checks

## HealthChecker Service

The `HealthChecker` service provides comprehensive health monitoring capabilities for your ViewAI integration.

### Initialization

{% code title="example.py" %}

```python
from viewai_client import ViewAIClient

# Initialize client (health checker is included)
client = ViewAIClient(api_key="your-api-key")

# Access health checker
health = client.health
```

{% endcode %}

## Basic Health Checks

### Connection Check

Test basic connectivity to the ViewAI API:

{% code title="connection\_check.py" %}

```python
# Check API connection
result = client.health.check_connection()

if result.healthy:
    print(f"Connected successfully")
    print(f"Response time: {result.response_time_ms:.2f}ms")
    print(f"Status code: {result.status_code}")
    print(f"Details: {result.details}")
else:
    print(f"Connection failed: {result.message}")
    print(f"Error details: {result.details}")
```

{% endcode %}

### Authentication Check

Verify API key and authentication:

{% code title="authentication\_check.py" %}

```python
# Check authentication
auth_result = client.health.check_authentication()

if auth_result.healthy:
    print("Authentication successful")
    print(f"Response time: {auth_result.response_time_ms:.2f}ms")
    print(f"Workspace count: {auth_result.details.get('workspace_count', 0)}")
else:
    print(f"Authentication failed: {auth_result.message}")
```

{% endcode %}

## HealthCheckResult Object

All health check methods return a `HealthCheckResult` object with:

{% code title="health\_check\_result.py" %}

```python
@dataclass
class HealthCheckResult:
    healthy: bool                    # Overall health status
    status_code: Optional[int]       # HTTP status code
    response_time_ms: Optional[float]  # Response time in milliseconds
    message: str                     # Status message
    details: Dict                    # Additional details
```

{% endcode %}

Example usage:

{% code title="example\_usage.py" %}

```python
result = client.health.check_connection()

print(f"Healthy: {result.healthy}")
print(f"Status: {result.status_code}")
print(f"Response time: {result.response_time_ms}ms")
print(f"Message: {result.message}")
print(f"Details: {result.details}")
```

{% endcode %}

## Comprehensive Diagnostics

### Running Diagnostics

Run a complete diagnostic suite:

{% code title="run\_diagnostics.py" %}

```python
# Run all diagnostics
diagnostics = client.health.run_diagnostics()

# Check results
for check_name, result in diagnostics.items():
    status = "✓" if result.healthy else "✗"
    print(f"{status} {check_name}: {result.message}")

    if result.response_time_ms:
        print(f"  Response time: {result.response_time_ms:.2f}ms")

# Overall status
all_healthy = all(result.healthy for result in diagnostics.values())
print(f"\nOverall status: {'HEALTHY' if all_healthy else 'UNHEALTHY'}")
```

{% endcode %}

### Diagnostic Checks

{% stepper %}
{% step %}

### Connection

Basic API connectivity
{% endstep %}

{% step %}

### Authentication

API key validation
{% endstep %}

{% step %}

### Workspaces

Workspace endpoint availability
{% endstep %}

{% step %}

### Projects

Projects endpoint availability
{% endstep %}
{% endstepper %}

### Checking Specific Endpoints

Test individual API endpoints:

{% code title="check\_api\_endpoints.py" %}

```python
# Check multiple endpoints
endpoint_results = client.health.check_api_endpoints()

for endpoint_name, result in endpoint_results.items():
    if result.healthy:
        print(f"✓ {endpoint_name}: OK ({result.response_time_ms:.2f}ms)")
    else:
        print(f"✗ {endpoint_name}: {result.message}")
```

{% endcode %}

## Performance Monitoring

### Measuring Latency

Measure API latency with multiple requests:

{% code title="measure\_latency.py" %}

```python
# Measure latency over 10 requests
latency_stats = client.health.measure_latency(num_requests=10)

print("Latency Statistics:")
print(f"  Minimum: {latency_stats['min_ms']:.2f}ms")
print(f"  Maximum: {latency_stats['max_ms']:.2f}ms")
print(f"  Average: {latency_stats['avg_ms']:.2f}ms")
print(f"  Median: {latency_stats['median_ms']:.2f}ms")
print(f"  Requests: {latency_stats['num_requests']}")
```

{% endcode %}

### Performance Thresholds

Set up alerts based on performance thresholds:

{% code title="performance\_thresholds.py" %}

```python
# Define performance thresholds
LATENCY_WARNING = 500   # ms
LATENCY_CRITICAL = 1000  # ms

latency_stats = client.health.measure_latency(num_requests=5)
avg_latency = latency_stats['avg_ms']

if avg_latency > LATENCY_CRITICAL:
    print(f"CRITICAL: Average latency {avg_latency:.2f}ms exceeds threshold")
elif avg_latency > LATENCY_WARNING:
    print(f"WARNING: Average latency {avg_latency:.2f}ms exceeds warning threshold")
else:
    print(f"OK: Average latency {avg_latency:.2f}ms is within acceptable range")
```

{% endcode %}

### Network Reliability Testing

Test network reliability with repeated requests:

{% code title="network\_reliability.py" %}

```python
# Test reliability over 20 attempts
reliability = client.health.test_network_reliability(num_attempts=20)

print("Network Reliability:")
print(f"  Total attempts: {reliability['total_attempts']}")
print(f"  Successful: {reliability['successful']}")
print(f"  Failed: {reliability['failed']}")
print(f"  Success rate: {reliability['success_rate_pct']:.1f}%")

if reliability['errors']:
    print("\nRecent errors:")
    for error in reliability['errors']:
        print(f"  - {error}")
```

{% endcode %}

## Automated Monitoring

### Simple Monitoring Loop

{% code title="monitor\_health.py" %}

```python
import time
from datetime import datetime

def monitor_health(duration_seconds=300, interval_seconds=30):
    """Monitor health for specified duration."""
    start_time = time.time()
    checks = []

    while time.time() - start_time < duration_seconds:
        # Perform health check
        result = client.health.check_connection()

        check_info = {
            'timestamp': datetime.now(),
            'healthy': result.healthy,
            'response_time': result.response_time_ms,
            'status_code': result.status_code
        }

        checks.append(check_info)

        # Log result
        status = "✓" if result.healthy else "✗"
        print(f"{status} {check_info['timestamp']}: {check_info['response_time']:.2f}ms")

        time.sleep(interval_seconds)

    # Calculate statistics
    healthy_count = sum(1 for c in checks if c['healthy'])
    uptime_pct = (healthy_count / len(checks)) * 100

    response_times = [c['response_time'] for c in checks if c['response_time']]
    avg_response = sum(response_times) / len(response_times) if response_times else 0

    print(f"\nMonitoring Summary:")
    print(f"  Total checks: {len(checks)}")
    print(f"  Healthy checks: {healthy_count}")
    print(f"  Uptime: {uptime_pct:.1f}%")
    print(f"  Average response time: {avg_response:.2f}ms")

    return checks

# Run monitoring
checks = monitor_health(duration_seconds=300, interval_seconds=30)
```

{% endcode %}

### Advanced Monitoring with Alerts

{% code title="health\_monitor.py" %}

```python
import time
from datetime import datetime

class HealthMonitor:
    """Advanced health monitoring with alerts."""

    def __init__(self, client, alert_callback=None):
        self.client = client
        self.alert_callback = alert_callback or self.default_alert
        self.history = []

    def default_alert(self, alert_type, message):
        """Default alert handler."""
        print(f"ALERT [{alert_type}]: {message}")

    def check_health(self):
        """Perform comprehensive health check."""
        result = self.client.health.check_connection()
        auth_result = self.client.health.check_authentication()

        return {
            'timestamp': datetime.now(),
            'connection': result,
            'authentication': auth_result
        }

    def analyze_check(self, check_result):
        """Analyze health check and trigger alerts."""
        conn = check_result['connection']
        auth = check_result['authentication']

        # Check connection
        if not conn.healthy:
            self.alert_callback('CRITICAL', f'Connection failed: {conn.message}')

        # Check authentication
        if not auth.healthy:
            self.alert_callback('CRITICAL', f'Authentication failed: {auth.message}')

        # Check latency
        if conn.response_time_ms and conn.response_time_ms > 1000:
            self.alert_callback('WARNING', f'High latency: {conn.response_time_ms:.2f}ms')

    def monitor(self, duration_seconds=300, interval_seconds=30):
        """Run monitoring loop."""
        start_time = time.time()

        while time.time() - start_time < duration_seconds:
            check_result = self.check_health()
            self.history.append(check_result)
            self.analyze_check(check_result)

            time.sleep(interval_seconds)

    def get_statistics(self):
        """Calculate monitoring statistics."""
        if not self.history:
            return {}

        connection_checks = [h['connection'] for h in self.history]
        healthy_count = sum(1 for c in connection_checks if c.healthy)

        response_times = [
            c.response_time_ms for c in connection_checks
            if c.response_time_ms
        ]

        return {
            'total_checks': len(self.history),
            'healthy_checks': healthy_count,
            'uptime_pct': (healthy_count / len(self.history)) * 100,
            'avg_response_time': sum(response_times) / len(response_times) if response_times else 0,
            'min_response_time': min(response_times) if response_times else 0,
            'max_response_time': max(response_times) if response_times else 0
        }

# Usage
monitor = HealthMonitor(client)

# Run monitoring
monitor.monitor(duration_seconds=300, interval_seconds=30)

# Get statistics
stats = monitor.get_statistics()
print(f"Uptime: {stats['uptime_pct']:.1f}%")
print(f"Average response time: {stats['avg_response_time']:.2f}ms")
```

{% endcode %}

## Health Reporting

### Generating Health Reports

{% code title="generate\_health\_report.py" %}

```python
from datetime import datetime

def generate_health_report(client):
    """Generate comprehensive health report."""

    report = {
        'timestamp': datetime.now().isoformat(),
        'checks': {}
    }

    # Connection check
    conn_result = client.health.check_connection()
    report['checks']['connection'] = {
        'healthy': conn_result.healthy,
        'response_time_ms': conn_result.response_time_ms,
        'status_code': conn_result.status_code,
        'message': conn_result.message
    }

    # Authentication check
    auth_result = client.health.check_authentication()
    report['checks']['authentication'] = {
        'healthy': auth_result.healthy,
        'response_time_ms': auth_result.response_time_ms,
        'message': auth_result.message
    }

    # Latency measurement
    latency = client.health.measure_latency(num_requests=5)
    report['performance'] = latency

    # Network reliability
    reliability = client.health.test_network_reliability(num_attempts=10)
    report['reliability'] = reliability

    # Overall status
    report['overall_healthy'] = all(
        check['healthy'] for check in report['checks'].values()
    )

    return report

# Generate report
report = generate_health_report(client)

# Display report
print("Health Report")
print("=" * 50)
print(f"Timestamp: {report['timestamp']}")
print(f"Overall Status: {'HEALTHY' if report['overall_healthy'] else 'UNHEALTHY'}")

print("\nChecks:")
for check_name, check_data in report['checks'].items():
    status = "✓" if check_data['healthy'] else "✗"
    print(f"  {status} {check_name}: {check_data['message']}")
    if check_data.get('response_time_ms'):
        print(f"     Response time: {check_data['response_time_ms']:.2f}ms")

print(f"\nPerformance:")
print(f"  Average latency: {report['performance']['avg_ms']:.2f}ms")
print(f"  Min latency: {report['performance']['min_ms']:.2f}ms")
print(f"  Max latency: {report['performance']['max_ms']:.2f}ms")

print(f"\nReliability:")
print(f"  Success rate: {report['reliability']['success_rate_pct']:.1f}%")
print(f"  Successful requests: {report['reliability']['successful']}")
print(f"  Failed requests: {report['reliability']['failed']}")
```

{% endcode %}

### Exporting Reports

{% code title="export\_health\_report.py" %}

```python
import json

def export_health_report(report, filename):
    """Export health report to JSON file."""
    with open(filename, 'w') as f:
        json.dump(report, f, indent=2, default=str)

    print(f"Report exported to {filename}")

# Export report
report = generate_health_report(client)
export_health_report(report, 'health_report.json')
```

{% endcode %}

## Integration Examples

### Flask Integration

{% code title="flask\_integration.py" %}

```python
from flask import Flask, jsonify
from viewai_client import ViewAIClient

app = Flask(__name__)
client = ViewAIClient(api_key="your-api-key")

@app.route('/health')
def health_check():
    """Health check endpoint."""
    result = client.health.check_connection()

    return jsonify({
        'healthy': result.healthy,
        'response_time_ms': result.response_time_ms,
        'message': result.message
    }), 200 if result.healthy else 503

@app.route('/health/detailed')
def detailed_health():
    """Detailed health check endpoint."""
    diagnostics = client.health.run_diagnostics()

    return jsonify({
        'checks': {
            name: {
                'healthy': result.healthy,
                'message': result.message,
                'response_time_ms': result.response_time_ms
            }
            for name, result in diagnostics.items()
        },
        'overall_healthy': all(r.healthy for r in diagnostics.values())
    })

if __name__ == '__main__':
    app.run(port=5000)
```

{% endcode %}

### FastAPI Integration

{% code title="fastapi\_integration.py" %}

```python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from viewai_client import ViewAIClient

app = FastAPI()
client = ViewAIClient(api_key="your-api-key")

class HealthResponse(BaseModel):
    healthy: bool
    response_time_ms: float
    message: str

@app.get("/health", response_model=HealthResponse)
async def health_check():
    """Health check endpoint."""
    result = client.health.check_connection()

    if not result.healthy:
        raise HTTPException(status_code=503, detail=result.message)

    return {
        'healthy': result.healthy,
        'response_time_ms': result.response_time_ms,
        'message': result.message
    }

@app.get("/health/diagnostics")
async def diagnostics():
    """Comprehensive diagnostics endpoint."""
    results = client.health.run_diagnostics()

    return {
        'checks': {
            name: {
                'healthy': result.healthy,
                'message': result.message,
                'response_time_ms': result.response_time_ms
            }
            for name, result in results.items()
        },
        'overall_healthy': all(r.healthy for r in results.values())
    }
```

{% endcode %}

### Scheduled Monitoring

{% code title="scheduled\_monitoring.py" %}

```python
from apscheduler.schedulers.background import BackgroundScheduler
from datetime import datetime

def scheduled_health_check():
    """Scheduled health check function."""
    result = client.health.check_connection()

    print(f"[{datetime.now()}] Health check: {'OK' if result.healthy else 'FAILED'}")

    if not result.healthy:
        # Send alert (email, Slack, etc.)
        send_alert(f"Health check failed: {result.message}")

# Set up scheduler
scheduler = BackgroundScheduler()
scheduler.add_job(
    scheduled_health_check,
    'interval',
    minutes=5,  # Run every 5 minutes
    id='health_check'
)

scheduler.start()
```

{% endcode %}

## Best Practices

{% stepper %}
{% step %}

### Regular Monitoring

Check health at regular intervals (every 5-10 minutes)
{% endstep %}

{% step %}

### Multiple Metrics

Monitor connection, authentication, and performance
{% endstep %}

{% step %}

### Alerting

Set up alerts for failures and performance degradation
{% endstep %}

{% step %}

### Thresholds

Define clear thresholds for warnings and critical alerts
{% endstep %}

{% step %}

### Logging

Log all health check results for historical analysis
{% endstep %}

{% step %}

### Response Times

Monitor response times to detect performance issues early
{% endstep %}

{% step %}

### Network Reliability

Track success rates to identify intermittent issues
{% endstep %}

{% step %}

### Comprehensive Checks

Use run\_diagnostics() for thorough testing
{% endstep %}

{% step %}

### Automated Recovery

Implement retry logic for transient failures
{% endstep %}

{% step %}

### Documentation

Document health check procedures and thresholds
{% endstep %}
{% endstepper %}

## Monitoring Strategies

### Development Environment

{% code title="dev\_monitoring.py" %}

```python
# Frequent checks with verbose output
def dev_monitoring():
    """Development environment monitoring."""
    diagnostics = client.health.run_diagnostics()

    for check_name, result in diagnostics.items():
        print(f"{check_name}: {result.message}")
        if result.details:
            print(f"  Details: {result.details}")
```

{% endcode %}

### Staging Environment

{% code title="staging\_monitoring.py" %}

```python
# Regular checks with alerting
def staging_monitoring():
    """Staging environment monitoring."""
    result = client.health.check_connection()
    auth_result = client.health.check_authentication()

    if not result.healthy or not auth_result.healthy:
        send_alert("Staging health check failed")

    log_health_status(result, auth_result)
```

{% endcode %}

### Production Environment

{% code title="production\_monitoring.py" %}

```python
# Comprehensive monitoring with metrics
def production_monitoring():
    """Production environment monitoring."""
    # Run comprehensive diagnostics
    diagnostics = client.health.run_diagnostics()

    # Measure performance
    latency = client.health.measure_latency(num_requests=10)

    # Test reliability
    reliability = client.health.test_network_reliability(num_attempts=20)

    # Send metrics to monitoring system
    send_metrics({
        'diagnostics': diagnostics,
        'latency': latency,
        'reliability': reliability
    })

    # Alert on failures
    if not all(r.healthy for r in diagnostics.values()):
        trigger_incident("ViewAI health check failed")
```

{% endcode %}

## Troubleshooting

### Connection Failures

{% stepper %}
{% step %}

### Check connection status

```python
result = client.health.check_connection()

if not result.healthy:
    print("Troubleshooting connection failure:")
    print(f"1. Check API endpoint: {client.http_client.base_url}")
    print(f"2. Verify network connectivity")
    print(f"3. Check firewall rules")
    print(f"4. Review error: {result.message}")
```

{% endstep %}
{% endstepper %}

### Authentication Failures

{% stepper %}
{% step %}

### Verify authentication

```python
auth_result = client.health.check_authentication()

if not auth_result.healthy:
    print("Troubleshooting authentication failure:")
    print(f"1. Verify API key is correct")
    print(f"2. Check API key expiration")
    print(f"3. Confirm API key permissions")
    print(f"4. Review error: {auth_result.message}")
```

{% endstep %}
{% endstepper %}

### High Latency

{% stepper %}
{% step %}

### Diagnose high latency

```python
latency = client.health.measure_latency(num_requests=10)

if latency['avg_ms'] > 500:
    print("High latency detected:")
    print(f"Average: {latency['avg_ms']:.2f}ms")
    print(f"Max: {latency['max_ms']:.2f}ms")
    print("Recommendations:")
    print("1. Check network connection")
    print("2. Verify server load")
    print("3. Consider geographic proximity")
    print("4. Review timeout settings")
```

{% endstep %}
{% endstepper %}

## Next Steps

* Learn about [Model Registry](broken://pages/69eb8a040ab4ac89ce519f89fd716975fee842c5)
* Explore [Version Management](broken://pages/5a50d3e7ba93df86f5b78caac6ee1d6e125d48aa)
* Review [Best Practices](broken://pages/4dd0782dc2b77e549c39957bc7bbc2101bd78d21)


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.viewai.ca/mlops/health-monitoring.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
