4.1 KiB
SRS-09: Monitoring & Alerting
1. Overview
Monitoring & Alerting module provides real-time system monitoring, metrics collection, alert management, and system health tracking.
2. Functional Requirements
2.1 System Metrics
FR-MON-001: System shall collect and display CPU metrics
- Output: CPU usage percentage, load average
- Refresh: Every 5 seconds
FR-MON-002: System shall collect and display memory metrics
- Output: Total memory, used memory, available memory, usage percentage
- Refresh: Every 5 seconds
FR-MON-003: System shall collect and display storage metrics
- Output: Total capacity, used capacity, available capacity, usage percentage
- Refresh: Every 5 seconds
FR-MON-004: System shall collect and display network throughput
- Output: Inbound/outbound throughput, historical data
- Refresh: Every 5 seconds
FR-MON-005: System shall display ZFS ARC statistics
- Output: ARC hit ratio, cache size, eviction statistics
- Refresh: Real-time
2.2 ZFS Health Monitoring
FR-MON-006: System shall display ZFS pool health
- Output: Pool status, health indicators, errors
FR-MON-007: System shall display ZFS dataset health
- Output: Dataset status, quota usage, compression ratio
2.3 System Logs
FR-MON-008: System shall display system logs
- Output: Log entries with timestamp, level, source, message
- Filtering: By level, time range, search
- Refresh: Every 10 minutes
FR-MON-009: System shall allow users to search logs
- Input: Search query
- Output: Filtered log entries
2.4 Active Jobs
FR-MON-010: System shall display active jobs
- Output: Job list with type, status, progress, start time
FR-MON-011: System shall allow users to view job details
- Output: Job configuration, progress, logs
2.5 Alert Management
FR-MON-012: System shall display active alerts
- Output: Alert list with severity, source, message, timestamp
FR-MON-013: System shall allow users to acknowledge alerts
- Input: Alert ID
- Action: Mark alert as acknowledged
FR-MON-014: System shall allow users to resolve alerts
- Input: Alert ID
- Action: Mark alert as resolved
FR-MON-015: System shall display alert history
- Output: Historical alerts with status, resolution
FR-MON-016: System shall allow users to configure alert rules
- Input: Rule name, condition, severity, enabled flag
- Output: Created alert rule
FR-MON-017: System shall evaluate alert rules
- Action: Automatic evaluation based on metrics
- Output: Generated alerts when conditions met
2.6 Health Checks
FR-MON-018: System shall perform health checks
- Output: Overall system health status (healthy/degraded/unhealthy)
FR-MON-019: System shall display health check details
- Output: Component health status, issues, recommendations
3. User Interface Requirements
3.1 Monitoring Dashboard
- Metrics cards (CPU, Memory, Storage, Network)
- Real-time charts (Network Throughput, ZFS ARC Hit Ratio)
- System health indicators
3.2 Tabs
- Active Jobs: Running jobs list
- System Logs: Log viewer with filtering
- Alerts History: Alert list and management
3.3 Alert Management
- Alert list with severity indicators
- Alert detail view
- Alert acknowledgment and resolution
4. API Endpoints
GET /api/v1/monitoring/metrics
GET /api/v1/monitoring/health
GET /api/v1/monitoring/alerts
GET /api/v1/monitoring/alerts/:id
POST /api/v1/monitoring/alerts/:id/acknowledge
POST /api/v1/monitoring/alerts/:id/resolve
GET /api/v1/monitoring/rules
POST /api/v1/monitoring/rules
PUT /api/v1/monitoring/rules/:id
DELETE /api/v1/monitoring/rules/:id
GET /api/v1/system/logs
GET /api/v1/system/network/throughput
5. Permissions
- monitoring:read: Required for viewing metrics, alerts, logs
- monitoring:write: Required for acknowledging/resolving alerts, configuring rules
6. Error Handling
- Metrics collection failures
- Alert rule evaluation errors
- Log access errors
- Insufficient permissions