Files
atlas/docs/LOGGING_DIAGNOSTICS.md
othman.suseno df475bc85e
Some checks failed
CI / test-build (push) Failing after 2m11s
logging and diagnostic features added
2025-12-15 00:45:14 +07:00

367 lines
7.4 KiB
Markdown

# Logging & Diagnostics
## Overview
AtlasOS provides comprehensive logging and diagnostic capabilities to help monitor system health, troubleshoot issues, and understand system behavior.
## Structured Logging
### Logger Package
The `internal/logger` package provides structured logging with:
- **Log Levels**: DEBUG, INFO, WARN, ERROR
- **JSON Mode**: Optional JSON-formatted output
- **Structured Fields**: Key-value pairs for context
- **Thread-Safe**: Safe for concurrent use
### Configuration
Configure logging via environment variables:
```bash
# Log level (DEBUG, INFO, WARN, ERROR)
export ATLAS_LOG_LEVEL=INFO
# Log format (json or text)
export ATLAS_LOG_FORMAT=json
```
### Usage
```go
import "gitea.avt.data-center.id/othman.suseno/atlas/internal/logger"
// Simple logging
logger.Info("User logged in")
logger.Error("Failed to create pool", err)
// With fields
logger.Info("Pool created", map[string]interface{}{
"pool": "tank",
"size": "10TB",
})
```
### Log Levels
- **DEBUG**: Detailed information for debugging
- **INFO**: General informational messages
- **WARN**: Warning messages for potential issues
- **ERROR**: Error messages for failures
## Request Logging
### Access Logs
All HTTP requests are logged with:
- **Timestamp**: Request time
- **Method**: HTTP method (GET, POST, etc.)
- **Path**: Request path
- **Status**: HTTP status code
- **Duration**: Request processing time
- **Request ID**: Unique request identifier
- **Remote Address**: Client IP address
**Example Log Entry:**
```
2024-12-20T10:30:56Z [INFO] 192.168.1.100 GET /api/v1/pools status=200 rid=abc123 dur=45ms
```
### Request ID
Every request gets a unique request ID:
- **Header**: `X-Request-Id`
- **Usage**: Track requests across services
- **Format**: 32-character hex string
## Diagnostic Endpoints
### System Information
**GET** `/api/v1/system/info`
Returns comprehensive system information:
```json
{
"version": "v0.1.0-dev",
"uptime": "3600 seconds",
"go_version": "go1.21.0",
"num_goroutines": 15,
"memory": {
"alloc": 1048576,
"total_alloc": 52428800,
"sys": 2097152,
"num_gc": 5
},
"services": {
"smb": {
"status": "running",
"last_check": "2024-12-20T10:30:56Z"
},
"nfs": {
"status": "running",
"last_check": "2024-12-20T10:30:56Z"
},
"iscsi": {
"status": "stopped",
"last_check": "2024-12-20T10:30:56Z"
}
},
"database": {
"connected": true,
"path": "/var/lib/atlas/atlas.db"
}
}
```
### Health Check
**GET** `/health`
Detailed health check with component status:
```json
{
"status": "healthy",
"timestamp": "2024-12-20T10:30:56Z",
"checks": {
"zfs": "healthy",
"database": "healthy",
"smb": "healthy",
"nfs": "healthy",
"iscsi": "stopped"
}
}
```
**Status Values:**
- `healthy`: Component is working correctly
- `degraded`: Some components have issues but system is operational
- `unhealthy`: Critical components are failing
**HTTP Status Codes:**
- `200 OK`: System is healthy or degraded
- `503 Service Unavailable`: System is unhealthy
### System Logs
**GET** `/api/v1/system/logs?limit=100`
Returns recent system logs (from audit logs):
```json
{
"logs": [
{
"timestamp": "2024-12-20T10:30:56Z",
"level": "INFO",
"actor": "user-1",
"action": "pool.create",
"resource": "pool:tank",
"result": "success",
"ip": "192.168.1.100"
}
],
"count": 1
}
```
**Query Parameters:**
- `limit`: Maximum number of logs to return (default: 100, max: 1000)
### Garbage Collection
**POST** `/api/v1/system/gc`
Triggers garbage collection and returns memory statistics:
```json
{
"before": {
"alloc": 1048576,
"total_alloc": 52428800,
"sys": 2097152,
"num_gc": 5
},
"after": {
"alloc": 512000,
"total_alloc": 52428800,
"sys": 2097152,
"num_gc": 6
},
"freed": 536576
}
```
## Audit Logging
Audit logs track all mutating operations:
- **Actor**: User ID or "system"
- **Action**: Operation type (e.g., "pool.create")
- **Resource**: Resource identifier
- **Result**: "success" or "failure"
- **IP**: Client IP address
- **User Agent**: Client user agent
- **Timestamp**: Operation time
See [Audit Logging Documentation](./AUDIT_LOGGING.md) for details.
## Log Rotation
### Current Implementation
- **In-Memory**: Audit logs stored in memory
- **Rotation**: Automatic rotation when max logs reached
- **Limit**: Configurable (default: 10,000 logs)
### Future Enhancements
- **File Logging**: Write logs to files
- **Automatic Rotation**: Rotate log files by size/age
- **Compression**: Compress old log files
- **Retention**: Configurable retention policies
## Best Practices
### 1. Use Appropriate Log Levels
```go
// Debug - detailed information
logger.Debug("Processing request", map[string]interface{}{
"request_id": reqID,
"user": userID,
})
// Info - important events
logger.Info("User logged in", map[string]interface{}{
"user": userID,
})
// Warn - potential issues
logger.Warn("High memory usage", map[string]interface{}{
"usage": "85%",
})
// Error - failures
logger.Error("Failed to create pool", err, map[string]interface{}{
"pool": poolName,
})
```
### 2. Include Context
Always include relevant context in logs:
```go
// Good
logger.Info("Pool created", map[string]interface{}{
"pool": poolName,
"size": poolSize,
"user": userID,
})
// Avoid
logger.Info("Pool created")
```
### 3. Use Request IDs
Include request IDs in logs for tracing:
```go
reqID := r.Context().Value(requestIDKey).(string)
logger.Info("Processing request", map[string]interface{}{
"request_id": reqID,
})
```
### 4. Monitor Health Endpoints
Regularly check health endpoints:
```bash
# Simple health check
curl http://localhost:8080/healthz
# Detailed health check
curl http://localhost:8080/health
# System information
curl http://localhost:8080/api/v1/system/info
```
## Monitoring
### Key Metrics
Monitor these metrics for system health:
- **Request Duration**: Track in access logs
- **Error Rate**: Count of error responses
- **Memory Usage**: Check via `/api/v1/system/info`
- **Goroutine Count**: Monitor for leaks
- **Service Status**: Check service health
### Alerting
Set up alerts for:
- **Unhealthy Status**: System health check fails
- **High Error Rate**: Too many error responses
- **Memory Leaks**: Continuously increasing memory
- **Service Failures**: Services not running
## Troubleshooting
### Check System Health
```bash
curl http://localhost:8080/health
```
### View System Information
```bash
curl http://localhost:8080/api/v1/system/info
```
### Check Recent Logs
```bash
curl http://localhost:8080/api/v1/system/logs?limit=50
```
### Trigger GC
```bash
curl -X POST http://localhost:8080/api/v1/system/gc
```
### View Request Logs
Check application logs for request details:
```bash
# If logging to stdout
./atlas-api | grep "GET /api/v1/pools"
# If logging to file
tail -f /var/log/atlas-api.log | grep "status=500"
```
## Future Enhancements
1. **File Logging**: Write logs to files with rotation
2. **Log Aggregation**: Support for centralized logging (ELK, Loki)
3. **Structured Logging**: Full JSON logging support
4. **Log Levels per Component**: Different levels for different components
5. **Performance Logging**: Detailed performance metrics
6. **Distributed Tracing**: Request tracing across services
7. **Log Filtering**: Filter logs by level, component, etc.
8. **Real-time Log Streaming**: Stream logs via WebSocket