# Logging & Diagnostics ## Overview AtlasOS provides comprehensive logging and diagnostic capabilities to help monitor system health, troubleshoot issues, and understand system behavior. ## Structured Logging ### Logger Package The `internal/logger` package provides structured logging with: - **Log Levels**: DEBUG, INFO, WARN, ERROR - **JSON Mode**: Optional JSON-formatted output - **Structured Fields**: Key-value pairs for context - **Thread-Safe**: Safe for concurrent use ### Configuration Configure logging via environment variables: ```bash # Log level (DEBUG, INFO, WARN, ERROR) export ATLAS_LOG_LEVEL=INFO # Log format (json or text) export ATLAS_LOG_FORMAT=json ``` ### Usage ```go import "gitea.avt.data-center.id/othman.suseno/atlas/internal/logger" // Simple logging logger.Info("User logged in") logger.Error("Failed to create pool", err) // With fields logger.Info("Pool created", map[string]interface{}{ "pool": "tank", "size": "10TB", }) ``` ### Log Levels - **DEBUG**: Detailed information for debugging - **INFO**: General informational messages - **WARN**: Warning messages for potential issues - **ERROR**: Error messages for failures ## Request Logging ### Access Logs All HTTP requests are logged with: - **Timestamp**: Request time - **Method**: HTTP method (GET, POST, etc.) - **Path**: Request path - **Status**: HTTP status code - **Duration**: Request processing time - **Request ID**: Unique request identifier - **Remote Address**: Client IP address **Example Log Entry:** ``` 2024-12-20T10:30:56Z [INFO] 192.168.1.100 GET /api/v1/pools status=200 rid=abc123 dur=45ms ``` ### Request ID Every request gets a unique request ID: - **Header**: `X-Request-Id` - **Usage**: Track requests across services - **Format**: 32-character hex string ## Diagnostic Endpoints ### System Information **GET** `/api/v1/system/info` Returns comprehensive system information: ```json { "version": "v0.1.0-dev", "uptime": "3600 seconds", "go_version": "go1.21.0", "num_goroutines": 15, "memory": { "alloc": 1048576, "total_alloc": 52428800, "sys": 2097152, "num_gc": 5 }, "services": { "smb": { "status": "running", "last_check": "2024-12-20T10:30:56Z" }, "nfs": { "status": "running", "last_check": "2024-12-20T10:30:56Z" }, "iscsi": { "status": "stopped", "last_check": "2024-12-20T10:30:56Z" } }, "database": { "connected": true, "path": "/var/lib/atlas/atlas.db" } } ``` ### Health Check **GET** `/health` Detailed health check with component status: ```json { "status": "healthy", "timestamp": "2024-12-20T10:30:56Z", "checks": { "zfs": "healthy", "database": "healthy", "smb": "healthy", "nfs": "healthy", "iscsi": "stopped" } } ``` **Status Values:** - `healthy`: Component is working correctly - `degraded`: Some components have issues but system is operational - `unhealthy`: Critical components are failing **HTTP Status Codes:** - `200 OK`: System is healthy or degraded - `503 Service Unavailable`: System is unhealthy ### System Logs **GET** `/api/v1/system/logs?limit=100` Returns recent system logs (from audit logs): ```json { "logs": [ { "timestamp": "2024-12-20T10:30:56Z", "level": "INFO", "actor": "user-1", "action": "pool.create", "resource": "pool:tank", "result": "success", "ip": "192.168.1.100" } ], "count": 1 } ``` **Query Parameters:** - `limit`: Maximum number of logs to return (default: 100, max: 1000) ### Garbage Collection **POST** `/api/v1/system/gc` Triggers garbage collection and returns memory statistics: ```json { "before": { "alloc": 1048576, "total_alloc": 52428800, "sys": 2097152, "num_gc": 5 }, "after": { "alloc": 512000, "total_alloc": 52428800, "sys": 2097152, "num_gc": 6 }, "freed": 536576 } ``` ## Audit Logging Audit logs track all mutating operations: - **Actor**: User ID or "system" - **Action**: Operation type (e.g., "pool.create") - **Resource**: Resource identifier - **Result**: "success" or "failure" - **IP**: Client IP address - **User Agent**: Client user agent - **Timestamp**: Operation time See [Audit Logging Documentation](./AUDIT_LOGGING.md) for details. ## Log Rotation ### Current Implementation - **In-Memory**: Audit logs stored in memory - **Rotation**: Automatic rotation when max logs reached - **Limit**: Configurable (default: 10,000 logs) ### Future Enhancements - **File Logging**: Write logs to files - **Automatic Rotation**: Rotate log files by size/age - **Compression**: Compress old log files - **Retention**: Configurable retention policies ## Best Practices ### 1. Use Appropriate Log Levels ```go // Debug - detailed information logger.Debug("Processing request", map[string]interface{}{ "request_id": reqID, "user": userID, }) // Info - important events logger.Info("User logged in", map[string]interface{}{ "user": userID, }) // Warn - potential issues logger.Warn("High memory usage", map[string]interface{}{ "usage": "85%", }) // Error - failures logger.Error("Failed to create pool", err, map[string]interface{}{ "pool": poolName, }) ``` ### 2. Include Context Always include relevant context in logs: ```go // Good logger.Info("Pool created", map[string]interface{}{ "pool": poolName, "size": poolSize, "user": userID, }) // Avoid logger.Info("Pool created") ``` ### 3. Use Request IDs Include request IDs in logs for tracing: ```go reqID := r.Context().Value(requestIDKey).(string) logger.Info("Processing request", map[string]interface{}{ "request_id": reqID, }) ``` ### 4. Monitor Health Endpoints Regularly check health endpoints: ```bash # Simple health check curl http://localhost:8080/healthz # Detailed health check curl http://localhost:8080/health # System information curl http://localhost:8080/api/v1/system/info ``` ## Monitoring ### Key Metrics Monitor these metrics for system health: - **Request Duration**: Track in access logs - **Error Rate**: Count of error responses - **Memory Usage**: Check via `/api/v1/system/info` - **Goroutine Count**: Monitor for leaks - **Service Status**: Check service health ### Alerting Set up alerts for: - **Unhealthy Status**: System health check fails - **High Error Rate**: Too many error responses - **Memory Leaks**: Continuously increasing memory - **Service Failures**: Services not running ## Troubleshooting ### Check System Health ```bash curl http://localhost:8080/health ``` ### View System Information ```bash curl http://localhost:8080/api/v1/system/info ``` ### Check Recent Logs ```bash curl http://localhost:8080/api/v1/system/logs?limit=50 ``` ### Trigger GC ```bash curl -X POST http://localhost:8080/api/v1/system/gc ``` ### View Request Logs Check application logs for request details: ```bash # If logging to stdout ./atlas-api | grep "GET /api/v1/pools" # If logging to file tail -f /var/log/atlas-api.log | grep "status=500" ``` ## Future Enhancements 1. **File Logging**: Write logs to files with rotation 2. **Log Aggregation**: Support for centralized logging (ELK, Loki) 3. **Structured Logging**: Full JSON logging support 4. **Log Levels per Component**: Different levels for different components 5. **Performance Logging**: Detailed performance metrics 6. **Distributed Tracing**: Request tracing across services 7. **Log Filtering**: Filter logs by level, component, etc. 8. **Real-time Log Streaming**: Stream logs via WebSocket