logging and diagnostic features added

2025-12-15 00:45:14 +07:00
parent 3e64de18ed
commit df475bc85e
26 changed files with 5878 additions and 91 deletions
--- a/docs/LOGGING_DIAGNOSTICS.md
+++ b/docs/LOGGING_DIAGNOSTICS.md
@@ -0,0 +1,366 @@
+# Logging & Diagnostics
+
+## Overview
+
+AtlasOS provides comprehensive logging and diagnostic capabilities to help monitor system health, troubleshoot issues, and understand system behavior.
+
+## Structured Logging
+
+### Logger Package
+
+The `internal/logger` package provides structured logging with:
+
+- **Log Levels**: DEBUG, INFO, WARN, ERROR
+- **JSON Mode**: Optional JSON-formatted output
+- **Structured Fields**: Key-value pairs for context
+- **Thread-Safe**: Safe for concurrent use
+
+### Configuration
+
+Configure logging via environment variables:
+
+```bash
+# Log level (DEBUG, INFO, WARN, ERROR)
+export ATLAS_LOG_LEVEL=INFO
+
+# Log format (json or text)
+export ATLAS_LOG_FORMAT=json
+```
+
+### Usage
+
+```go
+import "gitea.avt.data-center.id/othman.suseno/atlas/internal/logger"
+
+// Simple logging
+logger.Info("User logged in")
+logger.Error("Failed to create pool", err)
+
+// With fields
+logger.Info("Pool created", map[string]interface{}{
+    "pool": "tank",
+    "size": "10TB",
+})
+```
+
+### Log Levels
+
+- **DEBUG**: Detailed information for debugging
+- **INFO**: General informational messages
+- **WARN**: Warning messages for potential issues
+- **ERROR**: Error messages for failures
+
+## Request Logging
+
+### Access Logs
+
+All HTTP requests are logged with:
+
+- **Timestamp**: Request time
+- **Method**: HTTP method (GET, POST, etc.)
+- **Path**: Request path
+- **Status**: HTTP status code
+- **Duration**: Request processing time
+- **Request ID**: Unique request identifier
+- **Remote Address**: Client IP address
+
+**Example Log Entry:**
+```
+2024-12-20T10:30:56Z [INFO] 192.168.1.100 GET /api/v1/pools status=200 rid=abc123 dur=45ms
+```
+
+### Request ID
+
+Every request gets a unique request ID:
+
+- **Header**: `X-Request-Id`
+- **Usage**: Track requests across services
+- **Format**: 32-character hex string
+
+## Diagnostic Endpoints
+
+### System Information
+
+**GET** `/api/v1/system/info`
+
+Returns comprehensive system information:
+
+```json
+{
+  "version": "v0.1.0-dev",
+  "uptime": "3600 seconds",
+  "go_version": "go1.21.0",
+  "num_goroutines": 15,
+  "memory": {
+    "alloc": 1048576,
+    "total_alloc": 52428800,
+    "sys": 2097152,
+    "num_gc": 5
+  },
+  "services": {
+    "smb": {
+      "status": "running",
+      "last_check": "2024-12-20T10:30:56Z"
+    },
+    "nfs": {
+      "status": "running",
+      "last_check": "2024-12-20T10:30:56Z"
+    },
+    "iscsi": {
+      "status": "stopped",
+      "last_check": "2024-12-20T10:30:56Z"
+    }
+  },
+  "database": {
+    "connected": true,
+    "path": "/var/lib/atlas/atlas.db"
+  }
+}
+```
+
+### Health Check
+
+**GET** `/health`
+
+Detailed health check with component status:
+
+```json
+{
+  "status": "healthy",
+  "timestamp": "2024-12-20T10:30:56Z",
+  "checks": {
+    "zfs": "healthy",
+    "database": "healthy",
+    "smb": "healthy",
+    "nfs": "healthy",
+    "iscsi": "stopped"
+  }
+}
+```
+
+**Status Values:**
+- `healthy`: Component is working correctly
+- `degraded`: Some components have issues but system is operational
+- `unhealthy`: Critical components are failing
+
+**HTTP Status Codes:**
+- `200 OK`: System is healthy or degraded
+- `503 Service Unavailable`: System is unhealthy
+
+### System Logs
+
+**GET** `/api/v1/system/logs?limit=100`
+
+Returns recent system logs (from audit logs):
+
+```json
+{
+  "logs": [
+    {
+      "timestamp": "2024-12-20T10:30:56Z",
+      "level": "INFO",
+      "actor": "user-1",
+      "action": "pool.create",
+      "resource": "pool:tank",
+      "result": "success",
+      "ip": "192.168.1.100"
+    }
+  ],
+  "count": 1
+}
+```
+
+**Query Parameters:**
+- `limit`: Maximum number of logs to return (default: 100, max: 1000)
+
+### Garbage Collection
+
+**POST** `/api/v1/system/gc`
+
+Triggers garbage collection and returns memory statistics:
+
+```json
+{
+  "before": {
+    "alloc": 1048576,
+    "total_alloc": 52428800,
+    "sys": 2097152,
+    "num_gc": 5
+  },
+  "after": {
+    "alloc": 512000,
+    "total_alloc": 52428800,
+    "sys": 2097152,
+    "num_gc": 6
+  },
+  "freed": 536576
+}
+```
+
+## Audit Logging
+
+Audit logs track all mutating operations:
+
+- **Actor**: User ID or "system"
+- **Action**: Operation type (e.g., "pool.create")
+- **Resource**: Resource identifier
+- **Result**: "success" or "failure"
+- **IP**: Client IP address
+- **User Agent**: Client user agent
+- **Timestamp**: Operation time
+
+See [Audit Logging Documentation](./AUDIT_LOGGING.md) for details.
+
+## Log Rotation
+
+### Current Implementation
+
+- **In-Memory**: Audit logs stored in memory
+- **Rotation**: Automatic rotation when max logs reached
+- **Limit**: Configurable (default: 10,000 logs)
+
+### Future Enhancements
+
+- **File Logging**: Write logs to files
+- **Automatic Rotation**: Rotate log files by size/age
+- **Compression**: Compress old log files
+- **Retention**: Configurable retention policies
+
+## Best Practices
+
+### 1. Use Appropriate Log Levels
+
+```go
+// Debug - detailed information
+logger.Debug("Processing request", map[string]interface{}{
+    "request_id": reqID,
+    "user": userID,
+})
+
+// Info - important events
+logger.Info("User logged in", map[string]interface{}{
+    "user": userID,
+})
+
+// Warn - potential issues
+logger.Warn("High memory usage", map[string]interface{}{
+    "usage": "85%",
+})
+
+// Error - failures
+logger.Error("Failed to create pool", err, map[string]interface{}{
+    "pool": poolName,
+})
+```
+
+### 2. Include Context
+
+Always include relevant context in logs:
+
+```go
+// Good
+logger.Info("Pool created", map[string]interface{}{
+    "pool": poolName,
+    "size": poolSize,
+    "user": userID,
+})
+
+// Avoid
+logger.Info("Pool created")
+```
+
+### 3. Use Request IDs
+
+Include request IDs in logs for tracing:
+
+```go
+reqID := r.Context().Value(requestIDKey).(string)
+logger.Info("Processing request", map[string]interface{}{
+    "request_id": reqID,
+})
+```
+
+### 4. Monitor Health Endpoints
+
+Regularly check health endpoints:
+
+```bash
+# Simple health check
+curl http://localhost:8080/healthz
+
+# Detailed health check
+curl http://localhost:8080/health
+
+# System information
+curl http://localhost:8080/api/v1/system/info
+```
+
+## Monitoring
+
+### Key Metrics
+
+Monitor these metrics for system health:
+
+- **Request Duration**: Track in access logs
+- **Error Rate**: Count of error responses
+- **Memory Usage**: Check via `/api/v1/system/info`
+- **Goroutine Count**: Monitor for leaks
+- **Service Status**: Check service health
+
+### Alerting
+
+Set up alerts for:
+
+- **Unhealthy Status**: System health check fails
+- **High Error Rate**: Too many error responses
+- **Memory Leaks**: Continuously increasing memory
+- **Service Failures**: Services not running
+
+## Troubleshooting
+
+### Check System Health
+
+```bash
+curl http://localhost:8080/health
+```
+
+### View System Information
+
+```bash
+curl http://localhost:8080/api/v1/system/info
+```
+
+### Check Recent Logs
+
+```bash
+curl http://localhost:8080/api/v1/system/logs?limit=50
+```
+
+### Trigger GC
+
+```bash
+curl -X POST http://localhost:8080/api/v1/system/gc
+```
+
+### View Request Logs
+
+Check application logs for request details:
+
+```bash
+# If logging to stdout
+./atlas-api | grep "GET /api/v1/pools"
+
+# If logging to file
+tail -f /var/log/atlas-api.log | grep "status=500"
+```
+
+## Future Enhancements
+
+1. **File Logging**: Write logs to files with rotation
+2. **Log Aggregation**: Support for centralized logging (ELK, Loki)
+3. **Structured Logging**: Full JSON logging support
+4. **Log Levels per Component**: Different levels for different components
+5. **Performance Logging**: Detailed performance metrics
+6. **Distributed Tracing**: Request tracing across services
+7. **Log Filtering**: Filter logs by level, component, etc.
+8. **Real-time Log Streaming**: Stream logs via WebSocket