367 lines
7.4 KiB
Markdown
367 lines
7.4 KiB
Markdown
# Logging & Diagnostics
|
|
|
|
## Overview
|
|
|
|
AtlasOS provides comprehensive logging and diagnostic capabilities to help monitor system health, troubleshoot issues, and understand system behavior.
|
|
|
|
## Structured Logging
|
|
|
|
### Logger Package
|
|
|
|
The `internal/logger` package provides structured logging with:
|
|
|
|
- **Log Levels**: DEBUG, INFO, WARN, ERROR
|
|
- **JSON Mode**: Optional JSON-formatted output
|
|
- **Structured Fields**: Key-value pairs for context
|
|
- **Thread-Safe**: Safe for concurrent use
|
|
|
|
### Configuration
|
|
|
|
Configure logging via environment variables:
|
|
|
|
```bash
|
|
# Log level (DEBUG, INFO, WARN, ERROR)
|
|
export ATLAS_LOG_LEVEL=INFO
|
|
|
|
# Log format (json or text)
|
|
export ATLAS_LOG_FORMAT=json
|
|
```
|
|
|
|
### Usage
|
|
|
|
```go
|
|
import "gitea.avt.data-center.id/othman.suseno/atlas/internal/logger"
|
|
|
|
// Simple logging
|
|
logger.Info("User logged in")
|
|
logger.Error("Failed to create pool", err)
|
|
|
|
// With fields
|
|
logger.Info("Pool created", map[string]interface{}{
|
|
"pool": "tank",
|
|
"size": "10TB",
|
|
})
|
|
```
|
|
|
|
### Log Levels
|
|
|
|
- **DEBUG**: Detailed information for debugging
|
|
- **INFO**: General informational messages
|
|
- **WARN**: Warning messages for potential issues
|
|
- **ERROR**: Error messages for failures
|
|
|
|
## Request Logging
|
|
|
|
### Access Logs
|
|
|
|
All HTTP requests are logged with:
|
|
|
|
- **Timestamp**: Request time
|
|
- **Method**: HTTP method (GET, POST, etc.)
|
|
- **Path**: Request path
|
|
- **Status**: HTTP status code
|
|
- **Duration**: Request processing time
|
|
- **Request ID**: Unique request identifier
|
|
- **Remote Address**: Client IP address
|
|
|
|
**Example Log Entry:**
|
|
```
|
|
2024-12-20T10:30:56Z [INFO] 192.168.1.100 GET /api/v1/pools status=200 rid=abc123 dur=45ms
|
|
```
|
|
|
|
### Request ID
|
|
|
|
Every request gets a unique request ID:
|
|
|
|
- **Header**: `X-Request-Id`
|
|
- **Usage**: Track requests across services
|
|
- **Format**: 32-character hex string
|
|
|
|
## Diagnostic Endpoints
|
|
|
|
### System Information
|
|
|
|
**GET** `/api/v1/system/info`
|
|
|
|
Returns comprehensive system information:
|
|
|
|
```json
|
|
{
|
|
"version": "v0.1.0-dev",
|
|
"uptime": "3600 seconds",
|
|
"go_version": "go1.21.0",
|
|
"num_goroutines": 15,
|
|
"memory": {
|
|
"alloc": 1048576,
|
|
"total_alloc": 52428800,
|
|
"sys": 2097152,
|
|
"num_gc": 5
|
|
},
|
|
"services": {
|
|
"smb": {
|
|
"status": "running",
|
|
"last_check": "2024-12-20T10:30:56Z"
|
|
},
|
|
"nfs": {
|
|
"status": "running",
|
|
"last_check": "2024-12-20T10:30:56Z"
|
|
},
|
|
"iscsi": {
|
|
"status": "stopped",
|
|
"last_check": "2024-12-20T10:30:56Z"
|
|
}
|
|
},
|
|
"database": {
|
|
"connected": true,
|
|
"path": "/var/lib/atlas/atlas.db"
|
|
}
|
|
}
|
|
```
|
|
|
|
### Health Check
|
|
|
|
**GET** `/health`
|
|
|
|
Detailed health check with component status:
|
|
|
|
```json
|
|
{
|
|
"status": "healthy",
|
|
"timestamp": "2024-12-20T10:30:56Z",
|
|
"checks": {
|
|
"zfs": "healthy",
|
|
"database": "healthy",
|
|
"smb": "healthy",
|
|
"nfs": "healthy",
|
|
"iscsi": "stopped"
|
|
}
|
|
}
|
|
```
|
|
|
|
**Status Values:**
|
|
- `healthy`: Component is working correctly
|
|
- `degraded`: Some components have issues but system is operational
|
|
- `unhealthy`: Critical components are failing
|
|
|
|
**HTTP Status Codes:**
|
|
- `200 OK`: System is healthy or degraded
|
|
- `503 Service Unavailable`: System is unhealthy
|
|
|
|
### System Logs
|
|
|
|
**GET** `/api/v1/system/logs?limit=100`
|
|
|
|
Returns recent system logs (from audit logs):
|
|
|
|
```json
|
|
{
|
|
"logs": [
|
|
{
|
|
"timestamp": "2024-12-20T10:30:56Z",
|
|
"level": "INFO",
|
|
"actor": "user-1",
|
|
"action": "pool.create",
|
|
"resource": "pool:tank",
|
|
"result": "success",
|
|
"ip": "192.168.1.100"
|
|
}
|
|
],
|
|
"count": 1
|
|
}
|
|
```
|
|
|
|
**Query Parameters:**
|
|
- `limit`: Maximum number of logs to return (default: 100, max: 1000)
|
|
|
|
### Garbage Collection
|
|
|
|
**POST** `/api/v1/system/gc`
|
|
|
|
Triggers garbage collection and returns memory statistics:
|
|
|
|
```json
|
|
{
|
|
"before": {
|
|
"alloc": 1048576,
|
|
"total_alloc": 52428800,
|
|
"sys": 2097152,
|
|
"num_gc": 5
|
|
},
|
|
"after": {
|
|
"alloc": 512000,
|
|
"total_alloc": 52428800,
|
|
"sys": 2097152,
|
|
"num_gc": 6
|
|
},
|
|
"freed": 536576
|
|
}
|
|
```
|
|
|
|
## Audit Logging
|
|
|
|
Audit logs track all mutating operations:
|
|
|
|
- **Actor**: User ID or "system"
|
|
- **Action**: Operation type (e.g., "pool.create")
|
|
- **Resource**: Resource identifier
|
|
- **Result**: "success" or "failure"
|
|
- **IP**: Client IP address
|
|
- **User Agent**: Client user agent
|
|
- **Timestamp**: Operation time
|
|
|
|
See [Audit Logging Documentation](./AUDIT_LOGGING.md) for details.
|
|
|
|
## Log Rotation
|
|
|
|
### Current Implementation
|
|
|
|
- **In-Memory**: Audit logs stored in memory
|
|
- **Rotation**: Automatic rotation when max logs reached
|
|
- **Limit**: Configurable (default: 10,000 logs)
|
|
|
|
### Future Enhancements
|
|
|
|
- **File Logging**: Write logs to files
|
|
- **Automatic Rotation**: Rotate log files by size/age
|
|
- **Compression**: Compress old log files
|
|
- **Retention**: Configurable retention policies
|
|
|
|
## Best Practices
|
|
|
|
### 1. Use Appropriate Log Levels
|
|
|
|
```go
|
|
// Debug - detailed information
|
|
logger.Debug("Processing request", map[string]interface{}{
|
|
"request_id": reqID,
|
|
"user": userID,
|
|
})
|
|
|
|
// Info - important events
|
|
logger.Info("User logged in", map[string]interface{}{
|
|
"user": userID,
|
|
})
|
|
|
|
// Warn - potential issues
|
|
logger.Warn("High memory usage", map[string]interface{}{
|
|
"usage": "85%",
|
|
})
|
|
|
|
// Error - failures
|
|
logger.Error("Failed to create pool", err, map[string]interface{}{
|
|
"pool": poolName,
|
|
})
|
|
```
|
|
|
|
### 2. Include Context
|
|
|
|
Always include relevant context in logs:
|
|
|
|
```go
|
|
// Good
|
|
logger.Info("Pool created", map[string]interface{}{
|
|
"pool": poolName,
|
|
"size": poolSize,
|
|
"user": userID,
|
|
})
|
|
|
|
// Avoid
|
|
logger.Info("Pool created")
|
|
```
|
|
|
|
### 3. Use Request IDs
|
|
|
|
Include request IDs in logs for tracing:
|
|
|
|
```go
|
|
reqID := r.Context().Value(requestIDKey).(string)
|
|
logger.Info("Processing request", map[string]interface{}{
|
|
"request_id": reqID,
|
|
})
|
|
```
|
|
|
|
### 4. Monitor Health Endpoints
|
|
|
|
Regularly check health endpoints:
|
|
|
|
```bash
|
|
# Simple health check
|
|
curl http://localhost:8080/healthz
|
|
|
|
# Detailed health check
|
|
curl http://localhost:8080/health
|
|
|
|
# System information
|
|
curl http://localhost:8080/api/v1/system/info
|
|
```
|
|
|
|
## Monitoring
|
|
|
|
### Key Metrics
|
|
|
|
Monitor these metrics for system health:
|
|
|
|
- **Request Duration**: Track in access logs
|
|
- **Error Rate**: Count of error responses
|
|
- **Memory Usage**: Check via `/api/v1/system/info`
|
|
- **Goroutine Count**: Monitor for leaks
|
|
- **Service Status**: Check service health
|
|
|
|
### Alerting
|
|
|
|
Set up alerts for:
|
|
|
|
- **Unhealthy Status**: System health check fails
|
|
- **High Error Rate**: Too many error responses
|
|
- **Memory Leaks**: Continuously increasing memory
|
|
- **Service Failures**: Services not running
|
|
|
|
## Troubleshooting
|
|
|
|
### Check System Health
|
|
|
|
```bash
|
|
curl http://localhost:8080/health
|
|
```
|
|
|
|
### View System Information
|
|
|
|
```bash
|
|
curl http://localhost:8080/api/v1/system/info
|
|
```
|
|
|
|
### Check Recent Logs
|
|
|
|
```bash
|
|
curl http://localhost:8080/api/v1/system/logs?limit=50
|
|
```
|
|
|
|
### Trigger GC
|
|
|
|
```bash
|
|
curl -X POST http://localhost:8080/api/v1/system/gc
|
|
```
|
|
|
|
### View Request Logs
|
|
|
|
Check application logs for request details:
|
|
|
|
```bash
|
|
# If logging to stdout
|
|
./atlas-api | grep "GET /api/v1/pools"
|
|
|
|
# If logging to file
|
|
tail -f /var/log/atlas-api.log | grep "status=500"
|
|
```
|
|
|
|
## Future Enhancements
|
|
|
|
1. **File Logging**: Write logs to files with rotation
|
|
2. **Log Aggregation**: Support for centralized logging (ELK, Loki)
|
|
3. **Structured Logging**: Full JSON logging support
|
|
4. **Log Levels per Component**: Different levels for different components
|
|
5. **Performance Logging**: Detailed performance metrics
|
|
6. **Distributed Tracing**: Request tracing across services
|
|
7. **Log Filtering**: Filter logs by level, component, etc.
|
|
8. **Real-time Log Streaming**: Stream logs via WebSocket
|