still fixing i40 vtl issue
This commit is contained in:
344
docs/healthcheck-script.md
Normal file
344
docs/healthcheck-script.md
Normal file
@@ -0,0 +1,344 @@
|
||||
# Calypso Appliance Health Check Script
|
||||
|
||||
## Overview
|
||||
Comprehensive health check script for all Calypso Appliance components. Performs automated checks across system resources, services, network, storage, and backup infrastructure.
|
||||
|
||||
## Installation
|
||||
Script location: `/usr/local/bin/calypso-healthcheck`
|
||||
|
||||
## Usage
|
||||
|
||||
### Basic Usage
|
||||
```bash
|
||||
# Run health check (requires root)
|
||||
calypso-healthcheck
|
||||
|
||||
# Run and save to specific location
|
||||
calypso-healthcheck 2>&1 | tee /root/healthcheck-$(date +%Y%m%d).log
|
||||
```
|
||||
|
||||
### Exit Codes
|
||||
- `0` - All checks passed (100% healthy)
|
||||
- `1` - Healthy with warnings (some non-critical issues)
|
||||
- `2` - Degraded (80%+ checks passed, some failures)
|
||||
- `3` - Critical (less than 80% checks passed)
|
||||
|
||||
### Automated Checks
|
||||
|
||||
#### System Resources (4 checks)
|
||||
- Root filesystem usage (threshold: 80%)
|
||||
- /var filesystem usage (threshold: 80%)
|
||||
- Memory usage (threshold: 90%)
|
||||
- CPU load average
|
||||
|
||||
#### Database Services (2 checks)
|
||||
- PostgreSQL service status
|
||||
- Database presence (calypso, bacula)
|
||||
|
||||
#### Calypso Application (7 checks)
|
||||
- calypso-api service
|
||||
- calypso-frontend service
|
||||
- calypso-logger service
|
||||
- API port 8443
|
||||
- Frontend port 3000
|
||||
- API health endpoint
|
||||
- Frontend health endpoint
|
||||
|
||||
#### Backup Services - Bacula (8 checks)
|
||||
- bacula-director service
|
||||
- bacula-fd service
|
||||
- bacula-sd service
|
||||
- Director bconsole connectivity
|
||||
- Storage (Scalar-i500) accessibility
|
||||
- Director port 9101
|
||||
- FD port 9102
|
||||
- SD port 9103
|
||||
|
||||
#### Virtual Tape Library - mhVTL (4 checks)
|
||||
- mhvtl.target status
|
||||
- vtllibrary@10 (Scalar i500)
|
||||
- vtllibrary@30 (Scalar i40)
|
||||
- VTL device count (2 changers, 8 tape drives)
|
||||
- Scalar i500 slots detection
|
||||
|
||||
#### Storage Protocols (9 checks)
|
||||
- NFS server service
|
||||
- Samba (smbd) service
|
||||
- NetBIOS (nmbd) service
|
||||
- SCST service
|
||||
- iSCSI target service
|
||||
- NFS port 2049
|
||||
- SMB port 445
|
||||
- NetBIOS port 139
|
||||
- iSCSI port 3260
|
||||
|
||||
#### Monitoring & Management (2 checks)
|
||||
- SNMP daemon
|
||||
- SNMP port 161
|
||||
|
||||
#### Network Connectivity (2 checks)
|
||||
- Internet connectivity (ping 8.8.8.8)
|
||||
- Network manager status
|
||||
|
||||
**Total: 39+ automated checks**
|
||||
|
||||
## Output Format
|
||||
|
||||
### Console Output
|
||||
- Color-coded status indicators:
|
||||
- ✓ Green = Passed
|
||||
- ⚠ Yellow = Warning
|
||||
- ✗ Red = Failed
|
||||
|
||||
### Example Output
|
||||
```
|
||||
==========================================
|
||||
CALYPSO APPLIANCE HEALTH CHECK
|
||||
==========================================
|
||||
Date: 2025-12-31 01:46:27
|
||||
Hostname: calypso
|
||||
Uptime: up 6 days, 2 hours, 50 minutes
|
||||
Log file: /var/log/calypso-healthcheck-20251231-014627.log
|
||||
|
||||
========================================
|
||||
SYSTEM RESOURCES
|
||||
========================================
|
||||
✓ Root filesystem (18% used)
|
||||
✓ Var filesystem (18% used)
|
||||
✓ Memory usage (49% used, 8206MB available)
|
||||
✓ CPU load average (2.18, 8 cores)
|
||||
|
||||
...
|
||||
|
||||
========================================
|
||||
HEALTH CHECK SUMMARY
|
||||
========================================
|
||||
|
||||
Total Checks: 39
|
||||
Passed: 35
|
||||
Warnings: 0
|
||||
Failed: 4
|
||||
|
||||
⚠ OVERALL STATUS: DEGRADED (89%)
|
||||
```
|
||||
|
||||
### Log Files
|
||||
All checks are logged to: `/var/log/calypso-healthcheck-YYYYMMDD-HHMMSS.log`
|
||||
|
||||
Logs include:
|
||||
- Timestamp and system information
|
||||
- Detailed check results
|
||||
- Summary statistics
|
||||
- Overall health status
|
||||
|
||||
## Scheduling
|
||||
|
||||
### Manual Execution
|
||||
```bash
|
||||
# Run on demand
|
||||
sudo calypso-healthcheck
|
||||
```
|
||||
|
||||
### Cron Job (Recommended)
|
||||
Add to crontab for automated checks:
|
||||
|
||||
```bash
|
||||
# Daily health check at 2 AM
|
||||
0 2 * * * /usr/local/bin/calypso-healthcheck > /dev/null 2>&1
|
||||
|
||||
# Weekly health check on Monday at 6 AM with email notification
|
||||
0 6 * * 1 /usr/local/bin/calypso-healthcheck 2>&1 | mail -s "Calypso Health Check" admin@example.com
|
||||
```
|
||||
|
||||
### Systemd Timer (Alternative)
|
||||
Create `/etc/systemd/system/calypso-healthcheck.timer`:
|
||||
```ini
|
||||
[Unit]
|
||||
Description=Daily Calypso Health Check
|
||||
Requires=calypso-healthcheck.service
|
||||
|
||||
[Timer]
|
||||
OnCalendar=daily
|
||||
Persistent=true
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
```
|
||||
|
||||
Create `/etc/systemd/system/calypso-healthcheck.service`:
|
||||
```ini
|
||||
[Unit]
|
||||
Description=Calypso Appliance Health Check
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
ExecStart=/usr/local/bin/calypso-healthcheck
|
||||
```
|
||||
|
||||
Enable:
|
||||
```bash
|
||||
systemctl enable --now calypso-healthcheck.timer
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Failures
|
||||
|
||||
#### API/Frontend Health Endpoints Failing
|
||||
```bash
|
||||
# Check if services are running
|
||||
systemctl status calypso-api calypso-frontend
|
||||
|
||||
# Check service logs
|
||||
journalctl -u calypso-api -n 50
|
||||
journalctl -u calypso-frontend -n 50
|
||||
|
||||
# Test manually
|
||||
curl -k https://localhost:8443/health
|
||||
curl -k https://localhost:3000/health
|
||||
```
|
||||
|
||||
#### Bacula Director Not Responding
|
||||
```bash
|
||||
# Check service
|
||||
systemctl status bacula-director
|
||||
|
||||
# Test bconsole
|
||||
echo "status director" | bconsole
|
||||
|
||||
# Check logs
|
||||
tail -50 /var/log/bacula/bacula.log
|
||||
```
|
||||
|
||||
#### VTL Slots Not Detected
|
||||
```bash
|
||||
# Check VTL services
|
||||
systemctl status mhvtl.target
|
||||
|
||||
# Check devices
|
||||
lsscsi | grep -E "mediumx|tape"
|
||||
|
||||
# Test manually
|
||||
mtx -f /dev/sg7 status
|
||||
echo "update slots storage=Scalar-i500" | bconsole
|
||||
```
|
||||
|
||||
#### Storage Protocols Port Not Listening
|
||||
```bash
|
||||
# Check service status
|
||||
systemctl status nfs-server smbd nmbd scst iscsi-scstd
|
||||
|
||||
# Check listening ports
|
||||
ss -tuln | grep -E "2049|445|139|3260"
|
||||
|
||||
# Restart services if needed
|
||||
systemctl restart nfs-server
|
||||
systemctl restart smbd nmbd
|
||||
```
|
||||
|
||||
## Customization
|
||||
|
||||
### Modify Thresholds
|
||||
Edit `/usr/local/bin/calypso-healthcheck`:
|
||||
|
||||
```bash
|
||||
# Disk usage threshold (default: 80%)
|
||||
check_disk "/" 80 "Root filesystem"
|
||||
|
||||
# Memory usage threshold (default: 90%)
|
||||
if [ "$mem_percent" -lt 90 ]; then
|
||||
|
||||
# Change expected VTL devices
|
||||
if [ "$changer_count" -ge 2 ] && [ "$tape_count" -ge 8 ]; then
|
||||
```
|
||||
|
||||
### Add Custom Checks
|
||||
Add new check functions:
|
||||
|
||||
```bash
|
||||
check_custom() {
|
||||
TOTAL_CHECKS=$((TOTAL_CHECKS + 1))
|
||||
|
||||
if [[ condition ]]; then
|
||||
echo -e "${GREEN}${CHECK}${NC} Custom check passed" | tee -a "$LOG_FILE"
|
||||
PASSED_CHECKS=$((PASSED_CHECKS + 1))
|
||||
else
|
||||
echo -e "${RED}${CROSS}${NC} Custom check failed" | tee -a "$LOG_FILE"
|
||||
FAILED_CHECKS=$((FAILED_CHECKS + 1))
|
||||
fi
|
||||
}
|
||||
|
||||
# Call in main script
|
||||
check_custom
|
||||
```
|
||||
|
||||
## Integration
|
||||
|
||||
### Monitoring Systems
|
||||
Export metrics for monitoring:
|
||||
|
||||
```bash
|
||||
# Nagios/Icinga format
|
||||
calypso-healthcheck
|
||||
if [ $? -eq 0 ]; then
|
||||
echo "OK - All checks passed"
|
||||
exit 0
|
||||
elif [ $? -eq 1 ]; then
|
||||
echo "WARNING - Healthy with warnings"
|
||||
exit 1
|
||||
else
|
||||
echo "CRITICAL - System degraded"
|
||||
exit 2
|
||||
fi
|
||||
```
|
||||
|
||||
### API Integration
|
||||
Parse JSON output:
|
||||
|
||||
```bash
|
||||
# Add JSON output option
|
||||
calypso-healthcheck --json > /tmp/health.json
|
||||
```
|
||||
|
||||
## Maintenance
|
||||
|
||||
### Log Rotation
|
||||
Logs are stored in `/var/log/calypso-healthcheck-*.log`
|
||||
|
||||
Create `/etc/logrotate.d/calypso-healthcheck`:
|
||||
```
|
||||
/var/log/calypso-healthcheck-*.log {
|
||||
weekly
|
||||
rotate 12
|
||||
compress
|
||||
delaycompress
|
||||
missingok
|
||||
notifempty
|
||||
}
|
||||
```
|
||||
|
||||
### Cleanup Old Logs
|
||||
```bash
|
||||
# Remove logs older than 30 days
|
||||
find /var/log -name "calypso-healthcheck-*.log" -mtime +30 -delete
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Run after reboot** - Verify all services started correctly
|
||||
2. **Schedule regular checks** - Daily or weekly automated runs
|
||||
3. **Monitor exit codes** - Alert on degraded/critical status
|
||||
4. **Review logs periodically** - Identify patterns or recurring issues
|
||||
5. **Update checks** - Add new components as system evolves
|
||||
6. **Baseline health** - Establish normal operating parameters
|
||||
7. **Document exceptions** - Note known warnings that are acceptable
|
||||
|
||||
## See Also
|
||||
- `pre-reboot-checklist.md` - Pre-reboot verification
|
||||
- `bacula-vtl-troubleshooting.md` - VTL troubleshooting guide
|
||||
- System logs: `/var/log/syslog`, `/var/log/bacula/`
|
||||
|
||||
---
|
||||
|
||||
*Created: 2025-12-31*
|
||||
*Script: `/usr/local/bin/calypso-healthcheck`*
|
||||
Reference in New Issue
Block a user