Files
calypso/docs/alpha/srs/SRS-09-Monitoring-Alerting.md
2026-01-04 13:19:40 +07:00

128 lines
4.1 KiB
Markdown

# SRS-09: Monitoring & Alerting
## 1. Overview
Monitoring & Alerting module provides real-time system monitoring, metrics collection, alert management, and system health tracking.
## 2. Functional Requirements
### 2.1 System Metrics
**FR-MON-001**: System shall collect and display CPU metrics
- **Output**: CPU usage percentage, load average
- **Refresh**: Every 5 seconds
**FR-MON-002**: System shall collect and display memory metrics
- **Output**: Total memory, used memory, available memory, usage percentage
- **Refresh**: Every 5 seconds
**FR-MON-003**: System shall collect and display storage metrics
- **Output**: Total capacity, used capacity, available capacity, usage percentage
- **Refresh**: Every 5 seconds
**FR-MON-004**: System shall collect and display network throughput
- **Output**: Inbound/outbound throughput, historical data
- **Refresh**: Every 5 seconds
**FR-MON-005**: System shall display ZFS ARC statistics
- **Output**: ARC hit ratio, cache size, eviction statistics
- **Refresh**: Real-time
### 2.2 ZFS Health Monitoring
**FR-MON-006**: System shall display ZFS pool health
- **Output**: Pool status, health indicators, errors
**FR-MON-007**: System shall display ZFS dataset health
- **Output**: Dataset status, quota usage, compression ratio
### 2.3 System Logs
**FR-MON-008**: System shall display system logs
- **Output**: Log entries with timestamp, level, source, message
- **Filtering**: By level, time range, search
- **Refresh**: Every 10 minutes
**FR-MON-009**: System shall allow users to search logs
- **Input**: Search query
- **Output**: Filtered log entries
### 2.4 Active Jobs
**FR-MON-010**: System shall display active jobs
- **Output**: Job list with type, status, progress, start time
**FR-MON-011**: System shall allow users to view job details
- **Output**: Job configuration, progress, logs
### 2.5 Alert Management
**FR-MON-012**: System shall display active alerts
- **Output**: Alert list with severity, source, message, timestamp
**FR-MON-013**: System shall allow users to acknowledge alerts
- **Input**: Alert ID
- **Action**: Mark alert as acknowledged
**FR-MON-014**: System shall allow users to resolve alerts
- **Input**: Alert ID
- **Action**: Mark alert as resolved
**FR-MON-015**: System shall display alert history
- **Output**: Historical alerts with status, resolution
**FR-MON-016**: System shall allow users to configure alert rules
- **Input**: Rule name, condition, severity, enabled flag
- **Output**: Created alert rule
**FR-MON-017**: System shall evaluate alert rules
- **Action**: Automatic evaluation based on metrics
- **Output**: Generated alerts when conditions met
### 2.6 Health Checks
**FR-MON-018**: System shall perform health checks
- **Output**: Overall system health status (healthy/degraded/unhealthy)
**FR-MON-019**: System shall display health check details
- **Output**: Component health status, issues, recommendations
## 3. User Interface Requirements
### 3.1 Monitoring Dashboard
- Metrics cards (CPU, Memory, Storage, Network)
- Real-time charts (Network Throughput, ZFS ARC Hit Ratio)
- System health indicators
### 3.2 Tabs
- **Active Jobs**: Running jobs list
- **System Logs**: Log viewer with filtering
- **Alerts History**: Alert list and management
### 3.3 Alert Management
- Alert list with severity indicators
- Alert detail view
- Alert acknowledgment and resolution
## 4. API Endpoints
```
GET /api/v1/monitoring/metrics
GET /api/v1/monitoring/health
GET /api/v1/monitoring/alerts
GET /api/v1/monitoring/alerts/:id
POST /api/v1/monitoring/alerts/:id/acknowledge
POST /api/v1/monitoring/alerts/:id/resolve
GET /api/v1/monitoring/rules
POST /api/v1/monitoring/rules
PUT /api/v1/monitoring/rules/:id
DELETE /api/v1/monitoring/rules/:id
GET /api/v1/system/logs
GET /api/v1/system/network/throughput
```
## 5. Permissions
- **monitoring:read**: Required for viewing metrics, alerts, logs
- **monitoring:write**: Required for acknowledging/resolving alerts, configuring rules
## 6. Error Handling
- Metrics collection failures
- Alert rule evaluation errors
- Log access errors
- Insufficient permissions