Files
atlas/docs/LOGGING_DIAGNOSTICS.md
othman.suseno df475bc85e
Some checks failed
CI / test-build (push) Failing after 2m11s
logging and diagnostic features added
2025-12-15 00:45:14 +07:00

7.4 KiB

Logging & Diagnostics

Overview

AtlasOS provides comprehensive logging and diagnostic capabilities to help monitor system health, troubleshoot issues, and understand system behavior.

Structured Logging

Logger Package

The internal/logger package provides structured logging with:

  • Log Levels: DEBUG, INFO, WARN, ERROR
  • JSON Mode: Optional JSON-formatted output
  • Structured Fields: Key-value pairs for context
  • Thread-Safe: Safe for concurrent use

Configuration

Configure logging via environment variables:

# Log level (DEBUG, INFO, WARN, ERROR)
export ATLAS_LOG_LEVEL=INFO

# Log format (json or text)
export ATLAS_LOG_FORMAT=json

Usage

import "gitea.avt.data-center.id/othman.suseno/atlas/internal/logger"

// Simple logging
logger.Info("User logged in")
logger.Error("Failed to create pool", err)

// With fields
logger.Info("Pool created", map[string]interface{}{
    "pool": "tank",
    "size": "10TB",
})

Log Levels

  • DEBUG: Detailed information for debugging
  • INFO: General informational messages
  • WARN: Warning messages for potential issues
  • ERROR: Error messages for failures

Request Logging

Access Logs

All HTTP requests are logged with:

  • Timestamp: Request time
  • Method: HTTP method (GET, POST, etc.)
  • Path: Request path
  • Status: HTTP status code
  • Duration: Request processing time
  • Request ID: Unique request identifier
  • Remote Address: Client IP address

Example Log Entry:

2024-12-20T10:30:56Z [INFO] 192.168.1.100 GET /api/v1/pools status=200 rid=abc123 dur=45ms

Request ID

Every request gets a unique request ID:

  • Header: X-Request-Id
  • Usage: Track requests across services
  • Format: 32-character hex string

Diagnostic Endpoints

System Information

GET /api/v1/system/info

Returns comprehensive system information:

{
  "version": "v0.1.0-dev",
  "uptime": "3600 seconds",
  "go_version": "go1.21.0",
  "num_goroutines": 15,
  "memory": {
    "alloc": 1048576,
    "total_alloc": 52428800,
    "sys": 2097152,
    "num_gc": 5
  },
  "services": {
    "smb": {
      "status": "running",
      "last_check": "2024-12-20T10:30:56Z"
    },
    "nfs": {
      "status": "running",
      "last_check": "2024-12-20T10:30:56Z"
    },
    "iscsi": {
      "status": "stopped",
      "last_check": "2024-12-20T10:30:56Z"
    }
  },
  "database": {
    "connected": true,
    "path": "/var/lib/atlas/atlas.db"
  }
}

Health Check

GET /health

Detailed health check with component status:

{
  "status": "healthy",
  "timestamp": "2024-12-20T10:30:56Z",
  "checks": {
    "zfs": "healthy",
    "database": "healthy",
    "smb": "healthy",
    "nfs": "healthy",
    "iscsi": "stopped"
  }
}

Status Values:

  • healthy: Component is working correctly
  • degraded: Some components have issues but system is operational
  • unhealthy: Critical components are failing

HTTP Status Codes:

  • 200 OK: System is healthy or degraded
  • 503 Service Unavailable: System is unhealthy

System Logs

GET /api/v1/system/logs?limit=100

Returns recent system logs (from audit logs):

{
  "logs": [
    {
      "timestamp": "2024-12-20T10:30:56Z",
      "level": "INFO",
      "actor": "user-1",
      "action": "pool.create",
      "resource": "pool:tank",
      "result": "success",
      "ip": "192.168.1.100"
    }
  ],
  "count": 1
}

Query Parameters:

  • limit: Maximum number of logs to return (default: 100, max: 1000)

Garbage Collection

POST /api/v1/system/gc

Triggers garbage collection and returns memory statistics:

{
  "before": {
    "alloc": 1048576,
    "total_alloc": 52428800,
    "sys": 2097152,
    "num_gc": 5
  },
  "after": {
    "alloc": 512000,
    "total_alloc": 52428800,
    "sys": 2097152,
    "num_gc": 6
  },
  "freed": 536576
}

Audit Logging

Audit logs track all mutating operations:

  • Actor: User ID or "system"
  • Action: Operation type (e.g., "pool.create")
  • Resource: Resource identifier
  • Result: "success" or "failure"
  • IP: Client IP address
  • User Agent: Client user agent
  • Timestamp: Operation time

See Audit Logging Documentation for details.

Log Rotation

Current Implementation

  • In-Memory: Audit logs stored in memory
  • Rotation: Automatic rotation when max logs reached
  • Limit: Configurable (default: 10,000 logs)

Future Enhancements

  • File Logging: Write logs to files
  • Automatic Rotation: Rotate log files by size/age
  • Compression: Compress old log files
  • Retention: Configurable retention policies

Best Practices

1. Use Appropriate Log Levels

// Debug - detailed information
logger.Debug("Processing request", map[string]interface{}{
    "request_id": reqID,
    "user": userID,
})

// Info - important events
logger.Info("User logged in", map[string]interface{}{
    "user": userID,
})

// Warn - potential issues
logger.Warn("High memory usage", map[string]interface{}{
    "usage": "85%",
})

// Error - failures
logger.Error("Failed to create pool", err, map[string]interface{}{
    "pool": poolName,
})

2. Include Context

Always include relevant context in logs:

// Good
logger.Info("Pool created", map[string]interface{}{
    "pool": poolName,
    "size": poolSize,
    "user": userID,
})

// Avoid
logger.Info("Pool created")

3. Use Request IDs

Include request IDs in logs for tracing:

reqID := r.Context().Value(requestIDKey).(string)
logger.Info("Processing request", map[string]interface{}{
    "request_id": reqID,
})

4. Monitor Health Endpoints

Regularly check health endpoints:

# Simple health check
curl http://localhost:8080/healthz

# Detailed health check
curl http://localhost:8080/health

# System information
curl http://localhost:8080/api/v1/system/info

Monitoring

Key Metrics

Monitor these metrics for system health:

  • Request Duration: Track in access logs
  • Error Rate: Count of error responses
  • Memory Usage: Check via /api/v1/system/info
  • Goroutine Count: Monitor for leaks
  • Service Status: Check service health

Alerting

Set up alerts for:

  • Unhealthy Status: System health check fails
  • High Error Rate: Too many error responses
  • Memory Leaks: Continuously increasing memory
  • Service Failures: Services not running

Troubleshooting

Check System Health

curl http://localhost:8080/health

View System Information

curl http://localhost:8080/api/v1/system/info

Check Recent Logs

curl http://localhost:8080/api/v1/system/logs?limit=50

Trigger GC

curl -X POST http://localhost:8080/api/v1/system/gc

View Request Logs

Check application logs for request details:

# If logging to stdout
./atlas-api | grep "GET /api/v1/pools"

# If logging to file
tail -f /var/log/atlas-api.log | grep "status=500"

Future Enhancements

  1. File Logging: Write logs to files with rotation
  2. Log Aggregation: Support for centralized logging (ELK, Loki)
  3. Structured Logging: Full JSON logging support
  4. Log Levels per Component: Different levels for different components
  5. Performance Logging: Detailed performance metrics
  6. Distributed Tracing: Request tracing across services
  7. Log Filtering: Filter logs by level, component, etc.
  8. Real-time Log Streaming: Stream logs via WebSocket