This commit is contained in:
366
docs/LOGGING_DIAGNOSTICS.md
Normal file
366
docs/LOGGING_DIAGNOSTICS.md
Normal file
@@ -0,0 +1,366 @@
|
||||
# Logging & Diagnostics
|
||||
|
||||
## Overview
|
||||
|
||||
AtlasOS provides comprehensive logging and diagnostic capabilities to help monitor system health, troubleshoot issues, and understand system behavior.
|
||||
|
||||
## Structured Logging
|
||||
|
||||
### Logger Package
|
||||
|
||||
The `internal/logger` package provides structured logging with:
|
||||
|
||||
- **Log Levels**: DEBUG, INFO, WARN, ERROR
|
||||
- **JSON Mode**: Optional JSON-formatted output
|
||||
- **Structured Fields**: Key-value pairs for context
|
||||
- **Thread-Safe**: Safe for concurrent use
|
||||
|
||||
### Configuration
|
||||
|
||||
Configure logging via environment variables:
|
||||
|
||||
```bash
|
||||
# Log level (DEBUG, INFO, WARN, ERROR)
|
||||
export ATLAS_LOG_LEVEL=INFO
|
||||
|
||||
# Log format (json or text)
|
||||
export ATLAS_LOG_FORMAT=json
|
||||
```
|
||||
|
||||
### Usage
|
||||
|
||||
```go
|
||||
import "gitea.avt.data-center.id/othman.suseno/atlas/internal/logger"
|
||||
|
||||
// Simple logging
|
||||
logger.Info("User logged in")
|
||||
logger.Error("Failed to create pool", err)
|
||||
|
||||
// With fields
|
||||
logger.Info("Pool created", map[string]interface{}{
|
||||
"pool": "tank",
|
||||
"size": "10TB",
|
||||
})
|
||||
```
|
||||
|
||||
### Log Levels
|
||||
|
||||
- **DEBUG**: Detailed information for debugging
|
||||
- **INFO**: General informational messages
|
||||
- **WARN**: Warning messages for potential issues
|
||||
- **ERROR**: Error messages for failures
|
||||
|
||||
## Request Logging
|
||||
|
||||
### Access Logs
|
||||
|
||||
All HTTP requests are logged with:
|
||||
|
||||
- **Timestamp**: Request time
|
||||
- **Method**: HTTP method (GET, POST, etc.)
|
||||
- **Path**: Request path
|
||||
- **Status**: HTTP status code
|
||||
- **Duration**: Request processing time
|
||||
- **Request ID**: Unique request identifier
|
||||
- **Remote Address**: Client IP address
|
||||
|
||||
**Example Log Entry:**
|
||||
```
|
||||
2024-12-20T10:30:56Z [INFO] 192.168.1.100 GET /api/v1/pools status=200 rid=abc123 dur=45ms
|
||||
```
|
||||
|
||||
### Request ID
|
||||
|
||||
Every request gets a unique request ID:
|
||||
|
||||
- **Header**: `X-Request-Id`
|
||||
- **Usage**: Track requests across services
|
||||
- **Format**: 32-character hex string
|
||||
|
||||
## Diagnostic Endpoints
|
||||
|
||||
### System Information
|
||||
|
||||
**GET** `/api/v1/system/info`
|
||||
|
||||
Returns comprehensive system information:
|
||||
|
||||
```json
|
||||
{
|
||||
"version": "v0.1.0-dev",
|
||||
"uptime": "3600 seconds",
|
||||
"go_version": "go1.21.0",
|
||||
"num_goroutines": 15,
|
||||
"memory": {
|
||||
"alloc": 1048576,
|
||||
"total_alloc": 52428800,
|
||||
"sys": 2097152,
|
||||
"num_gc": 5
|
||||
},
|
||||
"services": {
|
||||
"smb": {
|
||||
"status": "running",
|
||||
"last_check": "2024-12-20T10:30:56Z"
|
||||
},
|
||||
"nfs": {
|
||||
"status": "running",
|
||||
"last_check": "2024-12-20T10:30:56Z"
|
||||
},
|
||||
"iscsi": {
|
||||
"status": "stopped",
|
||||
"last_check": "2024-12-20T10:30:56Z"
|
||||
}
|
||||
},
|
||||
"database": {
|
||||
"connected": true,
|
||||
"path": "/var/lib/atlas/atlas.db"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Health Check
|
||||
|
||||
**GET** `/health`
|
||||
|
||||
Detailed health check with component status:
|
||||
|
||||
```json
|
||||
{
|
||||
"status": "healthy",
|
||||
"timestamp": "2024-12-20T10:30:56Z",
|
||||
"checks": {
|
||||
"zfs": "healthy",
|
||||
"database": "healthy",
|
||||
"smb": "healthy",
|
||||
"nfs": "healthy",
|
||||
"iscsi": "stopped"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Status Values:**
|
||||
- `healthy`: Component is working correctly
|
||||
- `degraded`: Some components have issues but system is operational
|
||||
- `unhealthy`: Critical components are failing
|
||||
|
||||
**HTTP Status Codes:**
|
||||
- `200 OK`: System is healthy or degraded
|
||||
- `503 Service Unavailable`: System is unhealthy
|
||||
|
||||
### System Logs
|
||||
|
||||
**GET** `/api/v1/system/logs?limit=100`
|
||||
|
||||
Returns recent system logs (from audit logs):
|
||||
|
||||
```json
|
||||
{
|
||||
"logs": [
|
||||
{
|
||||
"timestamp": "2024-12-20T10:30:56Z",
|
||||
"level": "INFO",
|
||||
"actor": "user-1",
|
||||
"action": "pool.create",
|
||||
"resource": "pool:tank",
|
||||
"result": "success",
|
||||
"ip": "192.168.1.100"
|
||||
}
|
||||
],
|
||||
"count": 1
|
||||
}
|
||||
```
|
||||
|
||||
**Query Parameters:**
|
||||
- `limit`: Maximum number of logs to return (default: 100, max: 1000)
|
||||
|
||||
### Garbage Collection
|
||||
|
||||
**POST** `/api/v1/system/gc`
|
||||
|
||||
Triggers garbage collection and returns memory statistics:
|
||||
|
||||
```json
|
||||
{
|
||||
"before": {
|
||||
"alloc": 1048576,
|
||||
"total_alloc": 52428800,
|
||||
"sys": 2097152,
|
||||
"num_gc": 5
|
||||
},
|
||||
"after": {
|
||||
"alloc": 512000,
|
||||
"total_alloc": 52428800,
|
||||
"sys": 2097152,
|
||||
"num_gc": 6
|
||||
},
|
||||
"freed": 536576
|
||||
}
|
||||
```
|
||||
|
||||
## Audit Logging
|
||||
|
||||
Audit logs track all mutating operations:
|
||||
|
||||
- **Actor**: User ID or "system"
|
||||
- **Action**: Operation type (e.g., "pool.create")
|
||||
- **Resource**: Resource identifier
|
||||
- **Result**: "success" or "failure"
|
||||
- **IP**: Client IP address
|
||||
- **User Agent**: Client user agent
|
||||
- **Timestamp**: Operation time
|
||||
|
||||
See [Audit Logging Documentation](./AUDIT_LOGGING.md) for details.
|
||||
|
||||
## Log Rotation
|
||||
|
||||
### Current Implementation
|
||||
|
||||
- **In-Memory**: Audit logs stored in memory
|
||||
- **Rotation**: Automatic rotation when max logs reached
|
||||
- **Limit**: Configurable (default: 10,000 logs)
|
||||
|
||||
### Future Enhancements
|
||||
|
||||
- **File Logging**: Write logs to files
|
||||
- **Automatic Rotation**: Rotate log files by size/age
|
||||
- **Compression**: Compress old log files
|
||||
- **Retention**: Configurable retention policies
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Use Appropriate Log Levels
|
||||
|
||||
```go
|
||||
// Debug - detailed information
|
||||
logger.Debug("Processing request", map[string]interface{}{
|
||||
"request_id": reqID,
|
||||
"user": userID,
|
||||
})
|
||||
|
||||
// Info - important events
|
||||
logger.Info("User logged in", map[string]interface{}{
|
||||
"user": userID,
|
||||
})
|
||||
|
||||
// Warn - potential issues
|
||||
logger.Warn("High memory usage", map[string]interface{}{
|
||||
"usage": "85%",
|
||||
})
|
||||
|
||||
// Error - failures
|
||||
logger.Error("Failed to create pool", err, map[string]interface{}{
|
||||
"pool": poolName,
|
||||
})
|
||||
```
|
||||
|
||||
### 2. Include Context
|
||||
|
||||
Always include relevant context in logs:
|
||||
|
||||
```go
|
||||
// Good
|
||||
logger.Info("Pool created", map[string]interface{}{
|
||||
"pool": poolName,
|
||||
"size": poolSize,
|
||||
"user": userID,
|
||||
})
|
||||
|
||||
// Avoid
|
||||
logger.Info("Pool created")
|
||||
```
|
||||
|
||||
### 3. Use Request IDs
|
||||
|
||||
Include request IDs in logs for tracing:
|
||||
|
||||
```go
|
||||
reqID := r.Context().Value(requestIDKey).(string)
|
||||
logger.Info("Processing request", map[string]interface{}{
|
||||
"request_id": reqID,
|
||||
})
|
||||
```
|
||||
|
||||
### 4. Monitor Health Endpoints
|
||||
|
||||
Regularly check health endpoints:
|
||||
|
||||
```bash
|
||||
# Simple health check
|
||||
curl http://localhost:8080/healthz
|
||||
|
||||
# Detailed health check
|
||||
curl http://localhost:8080/health
|
||||
|
||||
# System information
|
||||
curl http://localhost:8080/api/v1/system/info
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Key Metrics
|
||||
|
||||
Monitor these metrics for system health:
|
||||
|
||||
- **Request Duration**: Track in access logs
|
||||
- **Error Rate**: Count of error responses
|
||||
- **Memory Usage**: Check via `/api/v1/system/info`
|
||||
- **Goroutine Count**: Monitor for leaks
|
||||
- **Service Status**: Check service health
|
||||
|
||||
### Alerting
|
||||
|
||||
Set up alerts for:
|
||||
|
||||
- **Unhealthy Status**: System health check fails
|
||||
- **High Error Rate**: Too many error responses
|
||||
- **Memory Leaks**: Continuously increasing memory
|
||||
- **Service Failures**: Services not running
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Check System Health
|
||||
|
||||
```bash
|
||||
curl http://localhost:8080/health
|
||||
```
|
||||
|
||||
### View System Information
|
||||
|
||||
```bash
|
||||
curl http://localhost:8080/api/v1/system/info
|
||||
```
|
||||
|
||||
### Check Recent Logs
|
||||
|
||||
```bash
|
||||
curl http://localhost:8080/api/v1/system/logs?limit=50
|
||||
```
|
||||
|
||||
### Trigger GC
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:8080/api/v1/system/gc
|
||||
```
|
||||
|
||||
### View Request Logs
|
||||
|
||||
Check application logs for request details:
|
||||
|
||||
```bash
|
||||
# If logging to stdout
|
||||
./atlas-api | grep "GET /api/v1/pools"
|
||||
|
||||
# If logging to file
|
||||
tail -f /var/log/atlas-api.log | grep "status=500"
|
||||
```
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
1. **File Logging**: Write logs to files with rotation
|
||||
2. **Log Aggregation**: Support for centralized logging (ELK, Loki)
|
||||
3. **Structured Logging**: Full JSON logging support
|
||||
4. **Log Levels per Component**: Different levels for different components
|
||||
5. **Performance Logging**: Detailed performance metrics
|
||||
6. **Distributed Tracing**: Request tracing across services
|
||||
7. **Log Filtering**: Filter logs by level, component, etc.
|
||||
8. **Real-time Log Streaming**: Stream logs via WebSocket
|
||||
Reference in New Issue
Block a user