Files
calypso/docs/healthcheck-script.md
2025-12-31 03:04:11 +07:00

7.4 KiB

Calypso Appliance Health Check Script

Overview

Comprehensive health check script for all Calypso Appliance components. Performs automated checks across system resources, services, network, storage, and backup infrastructure.

Installation

Script location: /usr/local/bin/calypso-healthcheck

Usage

Basic Usage

# Run health check (requires root)
calypso-healthcheck

# Run and save to specific location
calypso-healthcheck 2>&1 | tee /root/healthcheck-$(date +%Y%m%d).log

Exit Codes

  • 0 - All checks passed (100% healthy)
  • 1 - Healthy with warnings (some non-critical issues)
  • 2 - Degraded (80%+ checks passed, some failures)
  • 3 - Critical (less than 80% checks passed)

Automated Checks

System Resources (4 checks)

  • Root filesystem usage (threshold: 80%)
  • /var filesystem usage (threshold: 80%)
  • Memory usage (threshold: 90%)
  • CPU load average

Database Services (2 checks)

  • PostgreSQL service status
  • Database presence (calypso, bacula)

Calypso Application (7 checks)

  • calypso-api service
  • calypso-frontend service
  • calypso-logger service
  • API port 8443
  • Frontend port 3000
  • API health endpoint
  • Frontend health endpoint

Backup Services - Bacula (8 checks)

  • bacula-director service
  • bacula-fd service
  • bacula-sd service
  • Director bconsole connectivity
  • Storage (Scalar-i500) accessibility
  • Director port 9101
  • FD port 9102
  • SD port 9103

Virtual Tape Library - mhVTL (4 checks)

  • mhvtl.target status
  • vtllibrary@10 (Scalar i500)
  • vtllibrary@30 (Scalar i40)
  • VTL device count (2 changers, 8 tape drives)
  • Scalar i500 slots detection

Storage Protocols (9 checks)

  • NFS server service
  • Samba (smbd) service
  • NetBIOS (nmbd) service
  • SCST service
  • iSCSI target service
  • NFS port 2049
  • SMB port 445
  • NetBIOS port 139
  • iSCSI port 3260

Monitoring & Management (2 checks)

  • SNMP daemon
  • SNMP port 161

Network Connectivity (2 checks)

  • Internet connectivity (ping 8.8.8.8)
  • Network manager status

Total: 39+ automated checks

Output Format

Console Output

  • Color-coded status indicators:
    • ✓ Green = Passed
    • ⚠ Yellow = Warning
    • ✗ Red = Failed

Example Output

==========================================
  CALYPSO APPLIANCE HEALTH CHECK
==========================================
Date: 2025-12-31 01:46:27
Hostname: calypso
Uptime: up 6 days, 2 hours, 50 minutes
Log file: /var/log/calypso-healthcheck-20251231-014627.log

========================================
SYSTEM RESOURCES
========================================
✓ Root filesystem (18% used)
✓ Var filesystem (18% used)
✓ Memory usage (49% used, 8206MB available)
✓ CPU load average (2.18, 8 cores)

...

========================================
HEALTH CHECK SUMMARY
========================================

Total Checks:    39
Passed:          35
Warnings:        0
Failed:          4

⚠ OVERALL STATUS: DEGRADED (89%)

Log Files

All checks are logged to: /var/log/calypso-healthcheck-YYYYMMDD-HHMMSS.log

Logs include:

  • Timestamp and system information
  • Detailed check results
  • Summary statistics
  • Overall health status

Scheduling

Manual Execution

# Run on demand
sudo calypso-healthcheck

Add to crontab for automated checks:

# Daily health check at 2 AM
0 2 * * * /usr/local/bin/calypso-healthcheck > /dev/null 2>&1

# Weekly health check on Monday at 6 AM with email notification
0 6 * * 1 /usr/local/bin/calypso-healthcheck 2>&1 | mail -s "Calypso Health Check" admin@example.com

Systemd Timer (Alternative)

Create /etc/systemd/system/calypso-healthcheck.timer:

[Unit]
Description=Daily Calypso Health Check
Requires=calypso-healthcheck.service

[Timer]
OnCalendar=daily
Persistent=true

[Install]
WantedBy=timers.target

Create /etc/systemd/system/calypso-healthcheck.service:

[Unit]
Description=Calypso Appliance Health Check

[Service]
Type=oneshot
ExecStart=/usr/local/bin/calypso-healthcheck

Enable:

systemctl enable --now calypso-healthcheck.timer

Troubleshooting

Common Failures

API/Frontend Health Endpoints Failing

# Check if services are running
systemctl status calypso-api calypso-frontend

# Check service logs
journalctl -u calypso-api -n 50
journalctl -u calypso-frontend -n 50

# Test manually
curl -k https://localhost:8443/health
curl -k https://localhost:3000/health

Bacula Director Not Responding

# Check service
systemctl status bacula-director

# Test bconsole
echo "status director" | bconsole

# Check logs
tail -50 /var/log/bacula/bacula.log

VTL Slots Not Detected

# Check VTL services
systemctl status mhvtl.target

# Check devices
lsscsi | grep -E "mediumx|tape"

# Test manually
mtx -f /dev/sg7 status
echo "update slots storage=Scalar-i500" | bconsole

Storage Protocols Port Not Listening

# Check service status
systemctl status nfs-server smbd nmbd scst iscsi-scstd

# Check listening ports
ss -tuln | grep -E "2049|445|139|3260"

# Restart services if needed
systemctl restart nfs-server
systemctl restart smbd nmbd

Customization

Modify Thresholds

Edit /usr/local/bin/calypso-healthcheck:

# Disk usage threshold (default: 80%)
check_disk "/" 80 "Root filesystem"

# Memory usage threshold (default: 90%)
if [ "$mem_percent" -lt 90 ]; then

# Change expected VTL devices
if [ "$changer_count" -ge 2 ] && [ "$tape_count" -ge 8 ]; then

Add Custom Checks

Add new check functions:

check_custom() {
    TOTAL_CHECKS=$((TOTAL_CHECKS + 1))
    
    if [[ condition ]]; then
        echo -e "${GREEN}${CHECK}${NC} Custom check passed" | tee -a "$LOG_FILE"
        PASSED_CHECKS=$((PASSED_CHECKS + 1))
    else
        echo -e "${RED}${CROSS}${NC} Custom check failed" | tee -a "$LOG_FILE"
        FAILED_CHECKS=$((FAILED_CHECKS + 1))
    fi
}

# Call in main script
check_custom

Integration

Monitoring Systems

Export metrics for monitoring:

# Nagios/Icinga format
calypso-healthcheck
if [ $? -eq 0 ]; then
    echo "OK - All checks passed"
    exit 0
elif [ $? -eq 1 ]; then
    echo "WARNING - Healthy with warnings"
    exit 1
else
    echo "CRITICAL - System degraded"
    exit 2
fi

API Integration

Parse JSON output:

# Add JSON output option
calypso-healthcheck --json > /tmp/health.json

Maintenance

Log Rotation

Logs are stored in /var/log/calypso-healthcheck-*.log

Create /etc/logrotate.d/calypso-healthcheck:

/var/log/calypso-healthcheck-*.log {
    weekly
    rotate 12
    compress
    delaycompress
    missingok
    notifempty
}

Cleanup Old Logs

# Remove logs older than 30 days
find /var/log -name "calypso-healthcheck-*.log" -mtime +30 -delete

Best Practices

  1. Run after reboot - Verify all services started correctly
  2. Schedule regular checks - Daily or weekly automated runs
  3. Monitor exit codes - Alert on degraded/critical status
  4. Review logs periodically - Identify patterns or recurring issues
  5. Update checks - Add new components as system evolves
  6. Baseline health - Establish normal operating parameters
  7. Document exceptions - Note known warnings that are acceptable

See Also

  • pre-reboot-checklist.md - Pre-reboot verification
  • bacula-vtl-troubleshooting.md - VTL troubleshooting guide
  • System logs: /var/log/syslog, /var/log/bacula/

Created: 2025-12-31
Script: /usr/local/bin/calypso-healthcheck