othman.suseno/atlas

Fork 0

Files

othman.suseno df475bc85e

CI / test-build (push) Failing after 2m11s

Details

logging and diagnostic features added

2025-12-15 00:45:14 +07:00

7.4 KiB

Raw Permalink Blame History

Logging & Diagnostics

Overview

AtlasOS provides comprehensive logging and diagnostic capabilities to help monitor system health, troubleshoot issues, and understand system behavior.

Structured Logging

Logger Package

The internal/logger package provides structured logging with:

Log Levels: DEBUG, INFO, WARN, ERROR
JSON Mode: Optional JSON-formatted output
Structured Fields: Key-value pairs for context
Thread-Safe: Safe for concurrent use

Configuration

Configure logging via environment variables:

# Log level (DEBUG, INFO, WARN, ERROR)
export ATLAS_LOG_LEVEL=INFO

# Log format (json or text)
export ATLAS_LOG_FORMAT=json

Usage

import "gitea.avt.data-center.id/othman.suseno/atlas/internal/logger"

// Simple logging
logger.Info("User logged in")
logger.Error("Failed to create pool", err)

// With fields
logger.Info("Pool created", map[string]interface{}{
    "pool": "tank",
    "size": "10TB",
})

Log Levels

DEBUG: Detailed information for debugging
INFO: General informational messages
WARN: Warning messages for potential issues
ERROR: Error messages for failures

Request Logging

Access Logs

All HTTP requests are logged with:

Timestamp: Request time
Method: HTTP method (GET, POST, etc.)
Path: Request path
Status: HTTP status code
Duration: Request processing time
Request ID: Unique request identifier
Remote Address: Client IP address

Example Log Entry:

2024-12-20T10:30:56Z [INFO] 192.168.1.100 GET /api/v1/pools status=200 rid=abc123 dur=45ms

Request ID

Every request gets a unique request ID:

Header: X-Request-Id
Usage: Track requests across services
Format: 32-character hex string

Diagnostic Endpoints

System Information

GET /api/v1/system/info

Returns comprehensive system information:

{
  "version": "v0.1.0-dev",
  "uptime": "3600 seconds",
  "go_version": "go1.21.0",
  "num_goroutines": 15,
  "memory": {
    "alloc": 1048576,
    "total_alloc": 52428800,
    "sys": 2097152,
    "num_gc": 5
  },
  "services": {
    "smb": {
      "status": "running",
      "last_check": "2024-12-20T10:30:56Z"
    },
    "nfs": {
      "status": "running",
      "last_check": "2024-12-20T10:30:56Z"
    },
    "iscsi": {
      "status": "stopped",
      "last_check": "2024-12-20T10:30:56Z"
    }
  },
  "database": {
    "connected": true,
    "path": "/var/lib/atlas/atlas.db"
  }
}

Health Check

GET /health

Detailed health check with component status:

{
  "status": "healthy",
  "timestamp": "2024-12-20T10:30:56Z",
  "checks": {
    "zfs": "healthy",
    "database": "healthy",
    "smb": "healthy",
    "nfs": "healthy",
    "iscsi": "stopped"
  }
}

Status Values:

healthy: Component is working correctly
degraded: Some components have issues but system is operational
unhealthy: Critical components are failing

HTTP Status Codes:

200 OK: System is healthy or degraded
503 Service Unavailable: System is unhealthy

System Logs

GET /api/v1/system/logs?limit=100

Returns recent system logs (from audit logs):

{
  "logs": [
    {
      "timestamp": "2024-12-20T10:30:56Z",
      "level": "INFO",
      "actor": "user-1",
      "action": "pool.create",
      "resource": "pool:tank",
      "result": "success",
      "ip": "192.168.1.100"
    }
  ],
  "count": 1
}

Query Parameters:

limit: Maximum number of logs to return (default: 100, max: 1000)

Garbage Collection

POST /api/v1/system/gc

Triggers garbage collection and returns memory statistics:

{
  "before": {
    "alloc": 1048576,
    "total_alloc": 52428800,
    "sys": 2097152,
    "num_gc": 5
  },
  "after": {
    "alloc": 512000,
    "total_alloc": 52428800,
    "sys": 2097152,
    "num_gc": 6
  },
  "freed": 536576
}

Audit Logging

Audit logs track all mutating operations:

Actor: User ID or "system"
Action: Operation type (e.g., "pool.create")
Resource: Resource identifier
Result: "success" or "failure"
IP: Client IP address
User Agent: Client user agent
Timestamp: Operation time

See Audit Logging Documentation for details.

Log Rotation

Current Implementation

In-Memory: Audit logs stored in memory
Rotation: Automatic rotation when max logs reached
Limit: Configurable (default: 10,000 logs)

Future Enhancements

File Logging: Write logs to files
Automatic Rotation: Rotate log files by size/age
Compression: Compress old log files
Retention: Configurable retention policies

Best Practices

1. Use Appropriate Log Levels

// Debug - detailed information
logger.Debug("Processing request", map[string]interface{}{
    "request_id": reqID,
    "user": userID,
})

// Info - important events
logger.Info("User logged in", map[string]interface{}{
    "user": userID,
})

// Warn - potential issues
logger.Warn("High memory usage", map[string]interface{}{
    "usage": "85%",
})

// Error - failures
logger.Error("Failed to create pool", err, map[string]interface{}{
    "pool": poolName,
})

2. Include Context

Always include relevant context in logs:

// Good
logger.Info("Pool created", map[string]interface{}{
    "pool": poolName,
    "size": poolSize,
    "user": userID,
})

// Avoid
logger.Info("Pool created")

3. Use Request IDs

Include request IDs in logs for tracing:

reqID := r.Context().Value(requestIDKey).(string)
logger.Info("Processing request", map[string]interface{}{
    "request_id": reqID,
})

4. Monitor Health Endpoints

Regularly check health endpoints:

# Simple health check
curl http://localhost:8080/healthz

# Detailed health check
curl http://localhost:8080/health

# System information
curl http://localhost:8080/api/v1/system/info

Monitoring

Key Metrics

Monitor these metrics for system health:

Request Duration: Track in access logs
Error Rate: Count of error responses
Memory Usage: Check via /api/v1/system/info
Goroutine Count: Monitor for leaks
Service Status: Check service health

Alerting

Set up alerts for:

Unhealthy Status: System health check fails
High Error Rate: Too many error responses
Memory Leaks: Continuously increasing memory
Service Failures: Services not running

Troubleshooting

Check System Health

curl http://localhost:8080/health

View System Information

curl http://localhost:8080/api/v1/system/info

Check Recent Logs

curl http://localhost:8080/api/v1/system/logs?limit=50

Trigger GC

curl -X POST http://localhost:8080/api/v1/system/gc

View Request Logs

Check application logs for request details:

# If logging to stdout
./atlas-api | grep "GET /api/v1/pools"

# If logging to file
tail -f /var/log/atlas-api.log | grep "status=500"

Future Enhancements

File Logging: Write logs to files with rotation
Log Aggregation: Support for centralized logging (ELK, Loki)
Structured Logging: Full JSON logging support
Log Levels per Component: Different levels for different components
Performance Logging: Detailed performance metrics
Distributed Tracing: Request tracing across services
Log Filtering: Filter logs by level, component, etc.
Real-time Log Streaming: Stream logs via WebSocket

7.4 KiB Raw Permalink Blame History

Logging & Diagnostics

Overview

Structured Logging

Logger Package

Configuration

Usage

Log Levels

Request Logging

Access Logs

Request ID

Diagnostic Endpoints

System Information

Health Check

System Logs

Garbage Collection

Audit Logging

Log Rotation

Current Implementation

Future Enhancements

Best Practices

1. Use Appropriate Log Levels

2. Include Context

3. Use Request IDs

4. Monitor Health Endpoints

Monitoring

Key Metrics

Alerting

Troubleshooting

Check System Health

View System Information

Check Recent Logs

Trigger GC

View Request Logs

Future Enhancements

7.4 KiB

Raw Permalink Blame History