128 lines
4.1 KiB
Markdown
128 lines
4.1 KiB
Markdown
# SRS-09: Monitoring & Alerting
|
|
|
|
## 1. Overview
|
|
Monitoring & Alerting module provides real-time system monitoring, metrics collection, alert management, and system health tracking.
|
|
|
|
## 2. Functional Requirements
|
|
|
|
### 2.1 System Metrics
|
|
**FR-MON-001**: System shall collect and display CPU metrics
|
|
- **Output**: CPU usage percentage, load average
|
|
- **Refresh**: Every 5 seconds
|
|
|
|
**FR-MON-002**: System shall collect and display memory metrics
|
|
- **Output**: Total memory, used memory, available memory, usage percentage
|
|
- **Refresh**: Every 5 seconds
|
|
|
|
**FR-MON-003**: System shall collect and display storage metrics
|
|
- **Output**: Total capacity, used capacity, available capacity, usage percentage
|
|
- **Refresh**: Every 5 seconds
|
|
|
|
**FR-MON-004**: System shall collect and display network throughput
|
|
- **Output**: Inbound/outbound throughput, historical data
|
|
- **Refresh**: Every 5 seconds
|
|
|
|
**FR-MON-005**: System shall display ZFS ARC statistics
|
|
- **Output**: ARC hit ratio, cache size, eviction statistics
|
|
- **Refresh**: Real-time
|
|
|
|
### 2.2 ZFS Health Monitoring
|
|
**FR-MON-006**: System shall display ZFS pool health
|
|
- **Output**: Pool status, health indicators, errors
|
|
|
|
**FR-MON-007**: System shall display ZFS dataset health
|
|
- **Output**: Dataset status, quota usage, compression ratio
|
|
|
|
### 2.3 System Logs
|
|
**FR-MON-008**: System shall display system logs
|
|
- **Output**: Log entries with timestamp, level, source, message
|
|
- **Filtering**: By level, time range, search
|
|
- **Refresh**: Every 10 minutes
|
|
|
|
**FR-MON-009**: System shall allow users to search logs
|
|
- **Input**: Search query
|
|
- **Output**: Filtered log entries
|
|
|
|
### 2.4 Active Jobs
|
|
**FR-MON-010**: System shall display active jobs
|
|
- **Output**: Job list with type, status, progress, start time
|
|
|
|
**FR-MON-011**: System shall allow users to view job details
|
|
- **Output**: Job configuration, progress, logs
|
|
|
|
### 2.5 Alert Management
|
|
**FR-MON-012**: System shall display active alerts
|
|
- **Output**: Alert list with severity, source, message, timestamp
|
|
|
|
**FR-MON-013**: System shall allow users to acknowledge alerts
|
|
- **Input**: Alert ID
|
|
- **Action**: Mark alert as acknowledged
|
|
|
|
**FR-MON-014**: System shall allow users to resolve alerts
|
|
- **Input**: Alert ID
|
|
- **Action**: Mark alert as resolved
|
|
|
|
**FR-MON-015**: System shall display alert history
|
|
- **Output**: Historical alerts with status, resolution
|
|
|
|
**FR-MON-016**: System shall allow users to configure alert rules
|
|
- **Input**: Rule name, condition, severity, enabled flag
|
|
- **Output**: Created alert rule
|
|
|
|
**FR-MON-017**: System shall evaluate alert rules
|
|
- **Action**: Automatic evaluation based on metrics
|
|
- **Output**: Generated alerts when conditions met
|
|
|
|
### 2.6 Health Checks
|
|
**FR-MON-018**: System shall perform health checks
|
|
- **Output**: Overall system health status (healthy/degraded/unhealthy)
|
|
|
|
**FR-MON-019**: System shall display health check details
|
|
- **Output**: Component health status, issues, recommendations
|
|
|
|
## 3. User Interface Requirements
|
|
|
|
### 3.1 Monitoring Dashboard
|
|
- Metrics cards (CPU, Memory, Storage, Network)
|
|
- Real-time charts (Network Throughput, ZFS ARC Hit Ratio)
|
|
- System health indicators
|
|
|
|
### 3.2 Tabs
|
|
- **Active Jobs**: Running jobs list
|
|
- **System Logs**: Log viewer with filtering
|
|
- **Alerts History**: Alert list and management
|
|
|
|
### 3.3 Alert Management
|
|
- Alert list with severity indicators
|
|
- Alert detail view
|
|
- Alert acknowledgment and resolution
|
|
|
|
## 4. API Endpoints
|
|
|
|
```
|
|
GET /api/v1/monitoring/metrics
|
|
GET /api/v1/monitoring/health
|
|
GET /api/v1/monitoring/alerts
|
|
GET /api/v1/monitoring/alerts/:id
|
|
POST /api/v1/monitoring/alerts/:id/acknowledge
|
|
POST /api/v1/monitoring/alerts/:id/resolve
|
|
GET /api/v1/monitoring/rules
|
|
POST /api/v1/monitoring/rules
|
|
PUT /api/v1/monitoring/rules/:id
|
|
DELETE /api/v1/monitoring/rules/:id
|
|
|
|
GET /api/v1/system/logs
|
|
GET /api/v1/system/network/throughput
|
|
```
|
|
|
|
## 5. Permissions
|
|
- **monitoring:read**: Required for viewing metrics, alerts, logs
|
|
- **monitoring:write**: Required for acknowledging/resolving alerts, configuring rules
|
|
|
|
## 6. Error Handling
|
|
- Metrics collection failures
|
|
- Alert rule evaluation errors
|
|
- Log access errors
|
|
- Insufficient permissions
|
|
|