345 lines
7.4 KiB
Markdown
345 lines
7.4 KiB
Markdown
# Calypso Appliance Health Check Script
|
|
|
|
## Overview
|
|
Comprehensive health check script for all Calypso Appliance components. Performs automated checks across system resources, services, network, storage, and backup infrastructure.
|
|
|
|
## Installation
|
|
Script location: `/usr/local/bin/calypso-healthcheck`
|
|
|
|
## Usage
|
|
|
|
### Basic Usage
|
|
```bash
|
|
# Run health check (requires root)
|
|
calypso-healthcheck
|
|
|
|
# Run and save to specific location
|
|
calypso-healthcheck 2>&1 | tee /root/healthcheck-$(date +%Y%m%d).log
|
|
```
|
|
|
|
### Exit Codes
|
|
- `0` - All checks passed (100% healthy)
|
|
- `1` - Healthy with warnings (some non-critical issues)
|
|
- `2` - Degraded (80%+ checks passed, some failures)
|
|
- `3` - Critical (less than 80% checks passed)
|
|
|
|
### Automated Checks
|
|
|
|
#### System Resources (4 checks)
|
|
- Root filesystem usage (threshold: 80%)
|
|
- /var filesystem usage (threshold: 80%)
|
|
- Memory usage (threshold: 90%)
|
|
- CPU load average
|
|
|
|
#### Database Services (2 checks)
|
|
- PostgreSQL service status
|
|
- Database presence (calypso, bacula)
|
|
|
|
#### Calypso Application (7 checks)
|
|
- calypso-api service
|
|
- calypso-frontend service
|
|
- calypso-logger service
|
|
- API port 8443
|
|
- Frontend port 3000
|
|
- API health endpoint
|
|
- Frontend health endpoint
|
|
|
|
#### Backup Services - Bacula (8 checks)
|
|
- bacula-director service
|
|
- bacula-fd service
|
|
- bacula-sd service
|
|
- Director bconsole connectivity
|
|
- Storage (Scalar-i500) accessibility
|
|
- Director port 9101
|
|
- FD port 9102
|
|
- SD port 9103
|
|
|
|
#### Virtual Tape Library - mhVTL (4 checks)
|
|
- mhvtl.target status
|
|
- vtllibrary@10 (Scalar i500)
|
|
- vtllibrary@30 (Scalar i40)
|
|
- VTL device count (2 changers, 8 tape drives)
|
|
- Scalar i500 slots detection
|
|
|
|
#### Storage Protocols (9 checks)
|
|
- NFS server service
|
|
- Samba (smbd) service
|
|
- NetBIOS (nmbd) service
|
|
- SCST service
|
|
- iSCSI target service
|
|
- NFS port 2049
|
|
- SMB port 445
|
|
- NetBIOS port 139
|
|
- iSCSI port 3260
|
|
|
|
#### Monitoring & Management (2 checks)
|
|
- SNMP daemon
|
|
- SNMP port 161
|
|
|
|
#### Network Connectivity (2 checks)
|
|
- Internet connectivity (ping 8.8.8.8)
|
|
- Network manager status
|
|
|
|
**Total: 39+ automated checks**
|
|
|
|
## Output Format
|
|
|
|
### Console Output
|
|
- Color-coded status indicators:
|
|
- ✓ Green = Passed
|
|
- ⚠ Yellow = Warning
|
|
- ✗ Red = Failed
|
|
|
|
### Example Output
|
|
```
|
|
==========================================
|
|
CALYPSO APPLIANCE HEALTH CHECK
|
|
==========================================
|
|
Date: 2025-12-31 01:46:27
|
|
Hostname: calypso
|
|
Uptime: up 6 days, 2 hours, 50 minutes
|
|
Log file: /var/log/calypso-healthcheck-20251231-014627.log
|
|
|
|
========================================
|
|
SYSTEM RESOURCES
|
|
========================================
|
|
✓ Root filesystem (18% used)
|
|
✓ Var filesystem (18% used)
|
|
✓ Memory usage (49% used, 8206MB available)
|
|
✓ CPU load average (2.18, 8 cores)
|
|
|
|
...
|
|
|
|
========================================
|
|
HEALTH CHECK SUMMARY
|
|
========================================
|
|
|
|
Total Checks: 39
|
|
Passed: 35
|
|
Warnings: 0
|
|
Failed: 4
|
|
|
|
⚠ OVERALL STATUS: DEGRADED (89%)
|
|
```
|
|
|
|
### Log Files
|
|
All checks are logged to: `/var/log/calypso-healthcheck-YYYYMMDD-HHMMSS.log`
|
|
|
|
Logs include:
|
|
- Timestamp and system information
|
|
- Detailed check results
|
|
- Summary statistics
|
|
- Overall health status
|
|
|
|
## Scheduling
|
|
|
|
### Manual Execution
|
|
```bash
|
|
# Run on demand
|
|
sudo calypso-healthcheck
|
|
```
|
|
|
|
### Cron Job (Recommended)
|
|
Add to crontab for automated checks:
|
|
|
|
```bash
|
|
# Daily health check at 2 AM
|
|
0 2 * * * /usr/local/bin/calypso-healthcheck > /dev/null 2>&1
|
|
|
|
# Weekly health check on Monday at 6 AM with email notification
|
|
0 6 * * 1 /usr/local/bin/calypso-healthcheck 2>&1 | mail -s "Calypso Health Check" admin@example.com
|
|
```
|
|
|
|
### Systemd Timer (Alternative)
|
|
Create `/etc/systemd/system/calypso-healthcheck.timer`:
|
|
```ini
|
|
[Unit]
|
|
Description=Daily Calypso Health Check
|
|
Requires=calypso-healthcheck.service
|
|
|
|
[Timer]
|
|
OnCalendar=daily
|
|
Persistent=true
|
|
|
|
[Install]
|
|
WantedBy=timers.target
|
|
```
|
|
|
|
Create `/etc/systemd/system/calypso-healthcheck.service`:
|
|
```ini
|
|
[Unit]
|
|
Description=Calypso Appliance Health Check
|
|
|
|
[Service]
|
|
Type=oneshot
|
|
ExecStart=/usr/local/bin/calypso-healthcheck
|
|
```
|
|
|
|
Enable:
|
|
```bash
|
|
systemctl enable --now calypso-healthcheck.timer
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Failures
|
|
|
|
#### API/Frontend Health Endpoints Failing
|
|
```bash
|
|
# Check if services are running
|
|
systemctl status calypso-api calypso-frontend
|
|
|
|
# Check service logs
|
|
journalctl -u calypso-api -n 50
|
|
journalctl -u calypso-frontend -n 50
|
|
|
|
# Test manually
|
|
curl -k https://localhost:8443/health
|
|
curl -k https://localhost:3000/health
|
|
```
|
|
|
|
#### Bacula Director Not Responding
|
|
```bash
|
|
# Check service
|
|
systemctl status bacula-director
|
|
|
|
# Test bconsole
|
|
echo "status director" | bconsole
|
|
|
|
# Check logs
|
|
tail -50 /var/log/bacula/bacula.log
|
|
```
|
|
|
|
#### VTL Slots Not Detected
|
|
```bash
|
|
# Check VTL services
|
|
systemctl status mhvtl.target
|
|
|
|
# Check devices
|
|
lsscsi | grep -E "mediumx|tape"
|
|
|
|
# Test manually
|
|
mtx -f /dev/sg7 status
|
|
echo "update slots storage=Scalar-i500" | bconsole
|
|
```
|
|
|
|
#### Storage Protocols Port Not Listening
|
|
```bash
|
|
# Check service status
|
|
systemctl status nfs-server smbd nmbd scst iscsi-scstd
|
|
|
|
# Check listening ports
|
|
ss -tuln | grep -E "2049|445|139|3260"
|
|
|
|
# Restart services if needed
|
|
systemctl restart nfs-server
|
|
systemctl restart smbd nmbd
|
|
```
|
|
|
|
## Customization
|
|
|
|
### Modify Thresholds
|
|
Edit `/usr/local/bin/calypso-healthcheck`:
|
|
|
|
```bash
|
|
# Disk usage threshold (default: 80%)
|
|
check_disk "/" 80 "Root filesystem"
|
|
|
|
# Memory usage threshold (default: 90%)
|
|
if [ "$mem_percent" -lt 90 ]; then
|
|
|
|
# Change expected VTL devices
|
|
if [ "$changer_count" -ge 2 ] && [ "$tape_count" -ge 8 ]; then
|
|
```
|
|
|
|
### Add Custom Checks
|
|
Add new check functions:
|
|
|
|
```bash
|
|
check_custom() {
|
|
TOTAL_CHECKS=$((TOTAL_CHECKS + 1))
|
|
|
|
if [[ condition ]]; then
|
|
echo -e "${GREEN}${CHECK}${NC} Custom check passed" | tee -a "$LOG_FILE"
|
|
PASSED_CHECKS=$((PASSED_CHECKS + 1))
|
|
else
|
|
echo -e "${RED}${CROSS}${NC} Custom check failed" | tee -a "$LOG_FILE"
|
|
FAILED_CHECKS=$((FAILED_CHECKS + 1))
|
|
fi
|
|
}
|
|
|
|
# Call in main script
|
|
check_custom
|
|
```
|
|
|
|
## Integration
|
|
|
|
### Monitoring Systems
|
|
Export metrics for monitoring:
|
|
|
|
```bash
|
|
# Nagios/Icinga format
|
|
calypso-healthcheck
|
|
if [ $? -eq 0 ]; then
|
|
echo "OK - All checks passed"
|
|
exit 0
|
|
elif [ $? -eq 1 ]; then
|
|
echo "WARNING - Healthy with warnings"
|
|
exit 1
|
|
else
|
|
echo "CRITICAL - System degraded"
|
|
exit 2
|
|
fi
|
|
```
|
|
|
|
### API Integration
|
|
Parse JSON output:
|
|
|
|
```bash
|
|
# Add JSON output option
|
|
calypso-healthcheck --json > /tmp/health.json
|
|
```
|
|
|
|
## Maintenance
|
|
|
|
### Log Rotation
|
|
Logs are stored in `/var/log/calypso-healthcheck-*.log`
|
|
|
|
Create `/etc/logrotate.d/calypso-healthcheck`:
|
|
```
|
|
/var/log/calypso-healthcheck-*.log {
|
|
weekly
|
|
rotate 12
|
|
compress
|
|
delaycompress
|
|
missingok
|
|
notifempty
|
|
}
|
|
```
|
|
|
|
### Cleanup Old Logs
|
|
```bash
|
|
# Remove logs older than 30 days
|
|
find /var/log -name "calypso-healthcheck-*.log" -mtime +30 -delete
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
1. **Run after reboot** - Verify all services started correctly
|
|
2. **Schedule regular checks** - Daily or weekly automated runs
|
|
3. **Monitor exit codes** - Alert on degraded/critical status
|
|
4. **Review logs periodically** - Identify patterns or recurring issues
|
|
5. **Update checks** - Add new components as system evolves
|
|
6. **Baseline health** - Establish normal operating parameters
|
|
7. **Document exceptions** - Note known warnings that are acceptable
|
|
|
|
## See Also
|
|
- `pre-reboot-checklist.md` - Pre-reboot verification
|
|
- `bacula-vtl-troubleshooting.md` - VTL troubleshooting guide
|
|
- System logs: `/var/log/syslog`, `/var/log/bacula/`
|
|
|
|
---
|
|
|
|
*Created: 2025-12-31*
|
|
*Script: `/usr/local/bin/calypso-healthcheck`*
|