Files

Warp Agent a558c97088 still fixing i40 vtl issue

2025-12-31 03:04:11 +07:00

7.4 KiB

Raw Blame History

Calypso Appliance Health Check Script

Overview

Comprehensive health check script for all Calypso Appliance components. Performs automated checks across system resources, services, network, storage, and backup infrastructure.

Installation

Script location: /usr/local/bin/calypso-healthcheck

Usage

Basic Usage

# Run health check (requires root)
calypso-healthcheck

# Run and save to specific location
calypso-healthcheck 2>&1 | tee /root/healthcheck-$(date +%Y%m%d).log

Exit Codes

0 - All checks passed (100% healthy)
1 - Healthy with warnings (some non-critical issues)
2 - Degraded (80%+ checks passed, some failures)
3 - Critical (less than 80% checks passed)

Automated Checks

System Resources (4 checks)

Root filesystem usage (threshold: 80%)
/var filesystem usage (threshold: 80%)
Memory usage (threshold: 90%)
CPU load average

Database Services (2 checks)

PostgreSQL service status
Database presence (calypso, bacula)

Calypso Application (7 checks)

calypso-api service
calypso-frontend service
calypso-logger service
API port 8443
Frontend port 3000
API health endpoint
Frontend health endpoint

Backup Services - Bacula (8 checks)

bacula-director service
bacula-fd service
bacula-sd service
Director bconsole connectivity
Storage (Scalar-i500) accessibility
Director port 9101
FD port 9102
SD port 9103

Virtual Tape Library - mhVTL (4 checks)

mhvtl.target status
vtllibrary@10 (Scalar i500)
vtllibrary@30 (Scalar i40)
VTL device count (2 changers, 8 tape drives)
Scalar i500 slots detection

Storage Protocols (9 checks)

NFS server service
Samba (smbd) service
NetBIOS (nmbd) service
SCST service
iSCSI target service
NFS port 2049
SMB port 445
NetBIOS port 139
iSCSI port 3260

Monitoring & Management (2 checks)

SNMP daemon
SNMP port 161

Network Connectivity (2 checks)

Internet connectivity (ping 8.8.8.8)
Network manager status

Total: 39+ automated checks

Output Format

Console Output

Color-coded status indicators:
- ✓ Green = Passed
- ⚠ Yellow = Warning
- ✗ Red = Failed

Example Output

==========================================
  CALYPSO APPLIANCE HEALTH CHECK
==========================================
Date: 2025-12-31 01:46:27
Hostname: calypso
Uptime: up 6 days, 2 hours, 50 minutes
Log file: /var/log/calypso-healthcheck-20251231-014627.log

========================================
SYSTEM RESOURCES
========================================
✓ Root filesystem (18% used)
✓ Var filesystem (18% used)
✓ Memory usage (49% used, 8206MB available)
✓ CPU load average (2.18, 8 cores)

...

========================================
HEALTH CHECK SUMMARY
========================================

Total Checks:    39
Passed:          35
Warnings:        0
Failed:          4

⚠ OVERALL STATUS: DEGRADED (89%)

Log Files

All checks are logged to: /var/log/calypso-healthcheck-YYYYMMDD-HHMMSS.log

Logs include:

Timestamp and system information
Detailed check results
Summary statistics
Overall health status

Scheduling

Manual Execution

# Run on demand
sudo calypso-healthcheck

Cron Job (Recommended)

Add to crontab for automated checks:

# Daily health check at 2 AM
0 2 * * * /usr/local/bin/calypso-healthcheck > /dev/null 2>&1

# Weekly health check on Monday at 6 AM with email notification
0 6 * * 1 /usr/local/bin/calypso-healthcheck 2>&1 | mail -s "Calypso Health Check" admin@example.com

Systemd Timer (Alternative)

Create /etc/systemd/system/calypso-healthcheck.timer:

[Unit]
Description=Daily Calypso Health Check
Requires=calypso-healthcheck.service

[Timer]
OnCalendar=daily
Persistent=true

[Install]
WantedBy=timers.target

Create /etc/systemd/system/calypso-healthcheck.service:

[Unit]
Description=Calypso Appliance Health Check

[Service]
Type=oneshot
ExecStart=/usr/local/bin/calypso-healthcheck

Enable:

systemctl enable --now calypso-healthcheck.timer

Troubleshooting

Common Failures

API/Frontend Health Endpoints Failing

# Check if services are running
systemctl status calypso-api calypso-frontend

# Check service logs
journalctl -u calypso-api -n 50
journalctl -u calypso-frontend -n 50

# Test manually
curl -k https://localhost:8443/health
curl -k https://localhost:3000/health

Bacula Director Not Responding

# Check service
systemctl status bacula-director

# Test bconsole
echo "status director" | bconsole

# Check logs
tail -50 /var/log/bacula/bacula.log

VTL Slots Not Detected

# Check VTL services
systemctl status mhvtl.target

# Check devices
lsscsi | grep -E "mediumx|tape"

# Test manually
mtx -f /dev/sg7 status
echo "update slots storage=Scalar-i500" | bconsole

Storage Protocols Port Not Listening

# Check service status
systemctl status nfs-server smbd nmbd scst iscsi-scstd

# Check listening ports
ss -tuln | grep -E "2049|445|139|3260"

# Restart services if needed
systemctl restart nfs-server
systemctl restart smbd nmbd

Customization

Modify Thresholds

Edit /usr/local/bin/calypso-healthcheck:

# Disk usage threshold (default: 80%)
check_disk "/" 80 "Root filesystem"

# Memory usage threshold (default: 90%)
if [ "$mem_percent" -lt 90 ]; then

# Change expected VTL devices
if [ "$changer_count" -ge 2 ] && [ "$tape_count" -ge 8 ]; then

Add Custom Checks

Add new check functions:

check_custom() {
    TOTAL_CHECKS=$((TOTAL_CHECKS + 1))
    
    if [[ condition ]]; then
        echo -e "${GREEN}${CHECK}${NC} Custom check passed" | tee -a "$LOG_FILE"
        PASSED_CHECKS=$((PASSED_CHECKS + 1))
    else
        echo -e "${RED}${CROSS}${NC} Custom check failed" | tee -a "$LOG_FILE"
        FAILED_CHECKS=$((FAILED_CHECKS + 1))
    fi
}

# Call in main script
check_custom

Integration

Monitoring Systems

Export metrics for monitoring:

# Nagios/Icinga format
calypso-healthcheck
if [ $? -eq 0 ]; then
    echo "OK - All checks passed"
    exit 0
elif [ $? -eq 1 ]; then
    echo "WARNING - Healthy with warnings"
    exit 1
else
    echo "CRITICAL - System degraded"
    exit 2
fi

API Integration

Parse JSON output:

# Add JSON output option
calypso-healthcheck --json > /tmp/health.json

Maintenance

Log Rotation

Logs are stored in /var/log/calypso-healthcheck-*.log

Create /etc/logrotate.d/calypso-healthcheck:

/var/log/calypso-healthcheck-*.log {
    weekly
    rotate 12
    compress
    delaycompress
    missingok
    notifempty
}

Cleanup Old Logs

# Remove logs older than 30 days
find /var/log -name "calypso-healthcheck-*.log" -mtime +30 -delete

Best Practices

Run after reboot - Verify all services started correctly
Schedule regular checks - Daily or weekly automated runs
Monitor exit codes - Alert on degraded/critical status
Review logs periodically - Identify patterns or recurring issues
Update checks - Add new components as system evolves
Baseline health - Establish normal operating parameters
Document exceptions - Note known warnings that are acceptable

7.4 KiB Raw Blame History