atlas/docs/LOGGING_DIAGNOSTICS.md

# Logging & Diagnostics

## Overview

AtlasOS provides comprehensive logging and diagnostic capabilities to help monitor system health, troubleshoot issues, and understand system behavior.

## Structured Logging

### Logger Package

The `internal/logger` package provides structured logging with:

- **Log Levels**: DEBUG, INFO, WARN, ERROR
- **JSON Mode**: Optional JSON-formatted output
- **Structured Fields**: Key-value pairs for context
- **Thread-Safe**: Safe for concurrent use

### Configuration

Configure logging via environment variables:

```bash
# Log level (DEBUG, INFO, WARN, ERROR)
export ATLAS_LOG_LEVEL=INFO

# Log format (json or text)
export ATLAS_LOG_FORMAT=json
```

### Usage

```go
import "gitea.avt.data-center.id/othman.suseno/atlas/internal/logger"

// Simple logging
logger.Info("User logged in")
logger.Error("Failed to create pool", err)

// With fields
logger.Info("Pool created", map[string]interface{}{
    "pool": "tank",
    "size": "10TB",
})
```

### Log Levels

- **DEBUG**: Detailed information for debugging
- **INFO**: General informational messages
- **WARN**: Warning messages for potential issues
- **ERROR**: Error messages for failures

## Request Logging

### Access Logs

All HTTP requests are logged with:

- **Timestamp**: Request time
- **Method**: HTTP method (GET, POST, etc.)
- **Path**: Request path
- **Status**: HTTP status code
- **Duration**: Request processing time
- **Request ID**: Unique request identifier
- **Remote Address**: Client IP address

**Example Log Entry:**
```
2024-12-20T10:30:56Z [INFO] 192.168.1.100 GET /api/v1/pools status=200 rid=abc123 dur=45ms
```

### Request ID

Every request gets a unique request ID:

- **Header**: `X-Request-Id`
- **Usage**: Track requests across services
- **Format**: 32-character hex string

## Diagnostic Endpoints

### System Information

**GET** `/api/v1/system/info`

Returns comprehensive system information:

```json
{
  "version": "v0.1.0-dev",
  "uptime": "3600 seconds",
  "go_version": "go1.21.0",
  "num_goroutines": 15,
  "memory": {
    "alloc": 1048576,
    "total_alloc": 52428800,
    "sys": 2097152,
    "num_gc": 5
  },
  "services": {
    "smb": {
      "status": "running",
      "last_check": "2024-12-20T10:30:56Z"
    },
    "nfs": {
      "status": "running",
      "last_check": "2024-12-20T10:30:56Z"
    },
    "iscsi": {
      "status": "stopped",
      "last_check": "2024-12-20T10:30:56Z"
    }
  },
  "database": {
    "connected": true,
    "path": "/var/lib/atlas/atlas.db"
  }
}
```

### Health Check

**GET** `/health`

Detailed health check with component status:

```json
{
  "status": "healthy",
  "timestamp": "2024-12-20T10:30:56Z",
  "checks": {
    "zfs": "healthy",
    "database": "healthy",
    "smb": "healthy",
    "nfs": "healthy",
    "iscsi": "stopped"
  }
}
```

**Status Values:**
- `healthy`: Component is working correctly
- `degraded`: Some components have issues but system is operational
- `unhealthy`: Critical components are failing

**HTTP Status Codes:**
- `200 OK`: System is healthy or degraded
- `503 Service Unavailable`: System is unhealthy

### System Logs

**GET** `/api/v1/system/logs?limit=100`

Returns recent system logs (from audit logs):

```json
{
  "logs": [
    {
      "timestamp": "2024-12-20T10:30:56Z",
      "level": "INFO",
      "actor": "user-1",
      "action": "pool.create",
      "resource": "pool:tank",
      "result": "success",
      "ip": "192.168.1.100"
    }
  ],
  "count": 1
}
```

**Query Parameters:**
- `limit`: Maximum number of logs to return (default: 100, max: 1000)

### Garbage Collection

**POST** `/api/v1/system/gc`

Triggers garbage collection and returns memory statistics:

```json
{
  "before": {
    "alloc": 1048576,
    "total_alloc": 52428800,
    "sys": 2097152,
    "num_gc": 5
  },
  "after": {
    "alloc": 512000,
    "total_alloc": 52428800,
    "sys": 2097152,
    "num_gc": 6
  },
  "freed": 536576
}
```

## Audit Logging

Audit logs track all mutating operations:

- **Actor**: User ID or "system"
- **Action**: Operation type (e.g., "pool.create")
- **Resource**: Resource identifier
- **Result**: "success" or "failure"
- **IP**: Client IP address
- **User Agent**: Client user agent
- **Timestamp**: Operation time

See [Audit Logging Documentation](./AUDIT_LOGGING.md) for details.

## Log Rotation

### Current Implementation

- **In-Memory**: Audit logs stored in memory
- **Rotation**: Automatic rotation when max logs reached
- **Limit**: Configurable (default: 10,000 logs)

### Future Enhancements

- **File Logging**: Write logs to files
- **Automatic Rotation**: Rotate log files by size/age
- **Compression**: Compress old log files
- **Retention**: Configurable retention policies

## Best Practices

### 1. Use Appropriate Log Levels

```go
// Debug - detailed information
logger.Debug("Processing request", map[string]interface{}{
    "request_id": reqID,
    "user": userID,
})

// Info - important events
logger.Info("User logged in", map[string]interface{}{
    "user": userID,
})

// Warn - potential issues
logger.Warn("High memory usage", map[string]interface{}{
    "usage": "85%",
})

// Error - failures
logger.Error("Failed to create pool", err, map[string]interface{}{
    "pool": poolName,
})
```

### 2. Include Context

Always include relevant context in logs:

```go
// Good
logger.Info("Pool created", map[string]interface{}{
    "pool": poolName,
    "size": poolSize,
    "user": userID,
})

// Avoid
logger.Info("Pool created")
```

### 3. Use Request IDs

Include request IDs in logs for tracing:

```go
reqID := r.Context().Value(requestIDKey).(string)
logger.Info("Processing request", map[string]interface{}{
    "request_id": reqID,
})
```

### 4. Monitor Health Endpoints

Regularly check health endpoints:

```bash
# Simple health check
curl http://localhost:8080/healthz

# Detailed health check
curl http://localhost:8080/health

# System information
curl http://localhost:8080/api/v1/system/info
```

## Monitoring

### Key Metrics

Monitor these metrics for system health:

- **Request Duration**: Track in access logs
- **Error Rate**: Count of error responses
- **Memory Usage**: Check via `/api/v1/system/info`
- **Goroutine Count**: Monitor for leaks
- **Service Status**: Check service health

### Alerting

Set up alerts for:

- **Unhealthy Status**: System health check fails
- **High Error Rate**: Too many error responses
- **Memory Leaks**: Continuously increasing memory
- **Service Failures**: Services not running

## Troubleshooting

### Check System Health

```bash
curl http://localhost:8080/health
```

### View System Information

```bash
curl http://localhost:8080/api/v1/system/info
```

### Check Recent Logs

```bash
curl http://localhost:8080/api/v1/system/logs?limit=50
```

### Trigger GC

```bash
curl -X POST http://localhost:8080/api/v1/system/gc
```

### View Request Logs

Check application logs for request details:

```bash
# If logging to stdout
./atlas-api | grep "GET /api/v1/pools"

# If logging to file
tail -f /var/log/atlas-api.log | grep "status=500"
```

## Future Enhancements

1. **File Logging**: Write logs to files with rotation
2. **Log Aggregation**: Support for centralized logging (ELK, Loki)
3. **Structured Logging**: Full JSON logging support
4. **Log Levels per Component**: Different levels for different components
5. **Performance Logging**: Detailed performance metrics
6. **Distributed Tracing**: Request tracing across services
7. **Log Filtering**: Filter logs by level, component, etc.
8. **Real-time Log Streaming**: Stream logs via WebSocket