logging and diagnostic features added

2025-12-15 00:45:14 +07:00
parent 3e64de18ed
commit df475bc85e
26 changed files with 5878 additions and 91 deletions
--- a/docs/API_SECURITY.md
+++ b/docs/API_SECURITY.md
@@ -0,0 +1,278 @@
+# API Security & Rate Limiting
+
+## Overview
+
+AtlasOS implements comprehensive API security measures including rate limiting, security headers, CORS protection, and request validation to protect the API from abuse and attacks.
+
+## Rate Limiting
+
+### Token Bucket Algorithm
+
+The rate limiter uses a token bucket algorithm:
+- **Default Rate**: 100 requests per minute per client
+- **Window**: 60 seconds
+- **Token Refill**: Tokens are refilled based on elapsed time
+- **Per-Client**: Rate limiting is applied per IP address or user ID
+
+### Rate Limit Headers
+
+All responses include rate limit headers:
+
+```
+X-RateLimit-Limit: 100
+X-RateLimit-Window: 60
+```
+
+### Rate Limit Exceeded
+
+When rate limit is exceeded, the API returns:
+
+```json
+{
+  "code": "SERVICE_UNAVAILABLE",
+  "message": "rate limit exceeded",
+  "details": "too many requests, please try again later"
+}
+```
+
+**HTTP Status**: `429 Too Many Requests`
+
+### Client Identification
+
+Rate limiting uses different keys based on authentication:
+
+- **Authenticated Users**: `user:{user_id}` - More granular per-user limiting
+- **Unauthenticated**: `ip:{ip_address}` - IP-based limiting
+
+### Public Endpoints
+
+Public endpoints (login, health checks) are excluded from rate limiting to ensure availability.
+
+## Security Headers
+
+All responses include security headers:
+
+### X-Content-Type-Options
+- **Value**: `nosniff`
+- **Purpose**: Prevents MIME type sniffing
+
+### X-Frame-Options
+- **Value**: `DENY`
+- **Purpose**: Prevents clickjacking attacks
+
+### X-XSS-Protection
+- **Value**: `1; mode=block`
+- **Purpose**: Enables XSS filtering in browsers
+
+### Referrer-Policy
+- **Value**: `strict-origin-when-cross-origin`
+- **Purpose**: Controls referrer information
+
+### Permissions-Policy
+- **Value**: `geolocation=(), microphone=(), camera=()`
+- **Purpose**: Disables unnecessary browser features
+
+### Strict-Transport-Security (HSTS)
+- **Value**: `max-age=31536000; includeSubDomains`
+- **Purpose**: Forces HTTPS connections (only on HTTPS)
+- **Note**: Only added when request is over TLS
+
+### Content-Security-Policy (CSP)
+- **Value**: `default-src 'self'; script-src 'self' 'unsafe-inline' https://cdn.jsdelivr.net; style-src 'self' 'unsafe-inline' https://cdn.jsdelivr.net; img-src 'self' data:; font-src 'self' https://cdn.jsdelivr.net; connect-src 'self';`
+- **Purpose**: Restricts resource loading to prevent XSS
+
+## CORS (Cross-Origin Resource Sharing)
+
+### Allowed Origins
+
+By default, the following origins are allowed:
+
+- `http://localhost:8080`
+- `http://localhost:3000`
+- `http://127.0.0.1:8080`
+- Same-origin requests (no Origin header)
+
+### CORS Headers
+
+When a request comes from an allowed origin:
+
+```
+Access-Control-Allow-Origin: http://localhost:8080
+Access-Control-Allow-Methods: GET, POST, PUT, DELETE, PATCH, OPTIONS
+Access-Control-Allow-Headers: Content-Type, Authorization, X-Requested-With
+Access-Control-Allow-Credentials: true
+Access-Control-Max-Age: 3600
+```
+
+### Preflight Requests
+
+OPTIONS requests are handled automatically:
+
+- **Status**: `204 No Content`
+- **Headers**: All CORS headers included
+- **Purpose**: Browser preflight checks
+
+## Request Size Limits
+
+### Maximum Request Body Size
+
+- **Limit**: 10 MB (10,485,760 bytes)
+- **Enforcement**: Automatic via `http.MaxBytesReader`
+- **Error**: Returns `413 Request Entity Too Large` if exceeded
+
+### Content-Type Validation
+
+POST, PUT, and PATCH requests must include a valid `Content-Type` header:
+
+**Allowed Types:**
+- `application/json`
+- `application/x-www-form-urlencoded`
+- `multipart/form-data`
+
+**Error Response:**
+```json
+{
+  "code": "BAD_REQUEST",
+  "message": "Content-Type must be application/json"
+}
+```
+
+## Middleware Chain Order
+
+Security middleware is applied in the following order (outer to inner):
+
+1. **CORS** - Handles preflight requests
+2. **Security Headers** - Adds security headers
+3. **Request Size Limit** - Enforces 10MB limit
+4. **Content-Type Validation** - Validates request content type
+5. **Rate Limiting** - Enforces rate limits
+6. **Error Recovery** - Catches panics
+7. **Request ID** - Generates request IDs
+8. **Logging** - Logs requests
+9. **Audit** - Records audit logs
+10. **Authentication** - Validates JWT tokens
+11. **Routes** - Handles requests
+
+## Public Endpoints
+
+The following endpoints are excluded from certain security checks:
+
+- `/api/v1/auth/login` - Rate limiting, Content-Type validation
+- `/api/v1/auth/logout` - Rate limiting, Content-Type validation
+- `/healthz` - Rate limiting, Content-Type validation
+- `/metrics` - Rate limiting, Content-Type validation
+- `/api/docs` - Rate limiting, Content-Type validation
+- `/api/openapi.yaml` - Rate limiting, Content-Type validation
+
+## Best Practices
+
+### For API Consumers
+
+1. **Respect Rate Limits**: Implement exponential backoff when rate limited
+2. **Use Authentication**: Authenticated users get better rate limits
+3. **Include Content-Type**: Always include `Content-Type: application/json`
+4. **Handle Errors**: Check for `429` status and retry after delay
+5. **Request Size**: Keep request bodies under 10MB
+
+### For Administrators
+
+1. **Monitor Rate Limits**: Check logs for rate limit violations
+2. **Adjust Limits**: Modify rate limit values in code if needed
+3. **CORS Configuration**: Update allowed origins for production
+4. **HTTPS**: Always use HTTPS in production for HSTS
+5. **Security Headers**: Review CSP policy for your use case
+
+## Configuration
+
+### Rate Limiting
+
+Rate limits are currently hardcoded but can be configured:
+
+```go
+// In rate_limit.go
+rateLimiter := NewRateLimiter(100, time.Minute) // 100 req/min
+```
+
+### CORS Origins
+
+Update allowed origins in `security_middleware.go`:
+
+```go
+allowedOrigins := []string{
+    "https://yourdomain.com",
+    "https://app.yourdomain.com",
+}
+```
+
+### Request Size Limit
+
+Modify in `app.go`:
+
+```go
+a.requestSizeMiddleware(10*1024*1024) // 10MB
+```
+
+## Error Responses
+
+### Rate Limit Exceeded
+
+```json
+{
+  "code": "SERVICE_UNAVAILABLE",
+  "message": "rate limit exceeded",
+  "details": "too many requests, please try again later"
+}
+```
+
+**Status**: `429 Too Many Requests`
+
+### Request Too Large
+
+```json
+{
+  "code": "BAD_REQUEST",
+  "message": "request body too large"
+}
+```
+
+**Status**: `413 Request Entity Too Large`
+
+### Invalid Content-Type
+
+```json
+{
+  "code": "BAD_REQUEST",
+  "message": "Content-Type must be application/json"
+}
+```
+
+**Status**: `400 Bad Request`
+
+## Monitoring
+
+### Rate Limit Metrics
+
+Monitor rate limit violations:
+
+- Check audit logs for rate limit events
+- Monitor `429` status codes in access logs
+- Track rate limit headers in responses
+
+### Security Events
+
+Monitor for security-related events:
+
+- Invalid Content-Type headers
+- Request size violations
+- CORS violations (check server logs)
+- Authentication failures
+
+## Future Enhancements
+
+1. **Configurable Rate Limits**: Environment variable configuration
+2. **Per-Endpoint Limits**: Different limits for different endpoints
+3. **IP Whitelisting**: Bypass rate limits for trusted IPs
+4. **Rate Limit Metrics**: Prometheus metrics for rate limiting
+5. **Distributed Rate Limiting**: Redis-based for multi-instance deployments
+6. **Advanced CORS**: Configurable CORS via environment variables
+7. **Request Timeout**: Configurable request timeout limits
--- a/docs/BACKUP_RESTORE.md
+++ b/docs/BACKUP_RESTORE.md
@@ -0,0 +1,307 @@
+# Configuration Backup & Restore
+
+## Overview
+
+AtlasOS provides comprehensive configuration backup and restore functionality, allowing you to save and restore all system configurations including users, storage services (SMB/NFS/iSCSI), and snapshot policies.
+
+## Features
+
+- **Full Configuration Backup**: Backs up all system configurations
+- **Compressed Archives**: Backups are stored as gzipped tar archives
+- **Metadata Tracking**: Each backup includes metadata (ID, timestamp, description, size)
+- **Verification**: Verify backup integrity before restore
+- **Dry Run**: Test restore operations without making changes
+- **Selective Restore**: Restore specific components or full system
+
+## Configuration
+
+Set the backup directory using the `ATLAS_BACKUP_DIR` environment variable:
+
+```bash
+export ATLAS_BACKUP_DIR=/var/lib/atlas/backups
+./atlas-api
+```
+
+If not set, defaults to `data/backups` in the current directory.
+
+## Backup Contents
+
+A backup includes:
+
+- **Users**: All user accounts (passwords cannot be restored - users must reset)
+- **SMB Shares**: All SMB/CIFS share configurations
+- **NFS Exports**: All NFS export configurations
+- **iSCSI Targets**: All iSCSI targets and LUN mappings
+- **Snapshot Policies**: All automated snapshot policies
+- **System Config**: Database path and other system settings
+
+## API Endpoints
+
+### Create Backup
+
+**POST** `/api/v1/backups`
+
+Creates a new backup of all system configurations.
+
+**Request Body:**
+```json
+{
+  "description": "Backup before major changes"
+}
+```
+
+**Response:**
+```json
+{
+  "id": "backup-1703123456",
+  "created_at": "2024-12-20T10:30:56Z",
+  "version": "1.0",
+  "description": "Backup before major changes",
+  "size": 24576
+}
+```
+
+**Example:**
+```bash
+curl -X POST http://localhost:8080/api/v1/backups \
+  -H "Authorization: Bearer <token>" \
+  -H "Content-Type: application/json" \
+  -d '{"description": "Weekly backup"}'
+```
+
+### List Backups
+
+**GET** `/api/v1/backups`
+
+Lists all available backups.
+
+**Response:**
+```json
+[
+  {
+    "id": "backup-1703123456",
+    "created_at": "2024-12-20T10:30:56Z",
+    "version": "1.0",
+    "description": "Weekly backup",
+    "size": 24576
+  },
+  {
+    "id": "backup-1703037056",
+    "created_at": "2024-12-19T10:30:56Z",
+    "version": "1.0",
+    "description": "",
+    "size": 18432
+  }
+]
+```
+
+**Example:**
+```bash
+curl -X GET http://localhost:8080/api/v1/backups \
+  -H "Authorization: Bearer <token>"
+```
+
+### Get Backup Details
+
+**GET** `/api/v1/backups/{id}`
+
+Retrieves metadata for a specific backup.
+
+**Response:**
+```json
+{
+  "id": "backup-1703123456",
+  "created_at": "2024-12-20T10:30:56Z",
+  "version": "1.0",
+  "description": "Weekly backup",
+  "size": 24576
+}
+```
+
+**Example:**
+```bash
+curl -X GET http://localhost:8080/api/v1/backups/backup-1703123456 \
+  -H "Authorization: Bearer <token>"
+```
+
+### Verify Backup
+
+**GET** `/api/v1/backups/{id}?verify=true`
+
+Verifies that a backup file is valid and can be restored.
+
+**Response:**
+```json
+{
+  "message": "backup is valid",
+  "backup_id": "backup-1703123456",
+  "metadata": {
+    "id": "backup-1703123456",
+    "created_at": "2024-12-20T10:30:56Z",
+    "version": "1.0",
+    "description": "Weekly backup",
+    "size": 24576
+  }
+}
+```
+
+**Example:**
+```bash
+curl -X GET "http://localhost:8080/api/v1/backups/backup-1703123456?verify=true" \
+  -H "Authorization: Bearer <token>"
+```
+
+### Restore Backup
+
+**POST** `/api/v1/backups/{id}/restore`
+
+Restores configuration from a backup.
+
+**Request Body:**
+```json
+{
+  "dry_run": false
+}
+```
+
+**Parameters:**
+- `dry_run` (optional): If `true`, shows what would be restored without making changes
+
+**Response:**
+```json
+{
+  "message": "backup restored successfully",
+  "backup_id": "backup-1703123456"
+}
+```
+
+**Example:**
+```bash
+# Dry run (test restore)
+curl -X POST http://localhost:8080/api/v1/backups/backup-1703123456/restore \
+  -H "Authorization: Bearer <token>" \
+  -H "Content-Type: application/json" \
+  -d '{"dry_run": true}'
+
+# Actual restore
+curl -X POST http://localhost:8080/api/v1/backups/backup-1703123456/restore \
+  -H "Authorization: Bearer <token>" \
+  -H "Content-Type: application/json" \
+  -d '{"dry_run": false}'
+```
+
+### Delete Backup
+
+**DELETE** `/api/v1/backups/{id}`
+
+Deletes a backup file and its metadata.
+
+**Response:**
+```json
+{
+  "message": "backup deleted",
+  "backup_id": "backup-1703123456"
+}
+```
+
+**Example:**
+```bash
+curl -X DELETE http://localhost:8080/api/v1/backups/backup-1703123456 \
+  -H "Authorization: Bearer <token>"
+```
+
+## Restore Process
+
+When restoring a backup:
+
+1. **Verification**: Backup is verified before restore
+2. **User Restoration**: 
+   - Users are restored with temporary passwords
+   - Default admin user (user-1) is skipped
+   - Users must reset their passwords after restore
+3. **Storage Services**: 
+   - SMB shares, NFS exports, and iSCSI targets are restored
+   - Existing configurations are skipped (not overwritten)
+   - Service configurations are automatically applied
+4. **Snapshot Policies**: 
+   - Policies are restored by dataset
+   - Existing policies are skipped
+5. **Service Application**: 
+   - Samba, NFS, and iSCSI services are reconfigured
+   - Errors are logged but don't fail the restore
+
+## Backup File Format
+
+Backups are stored as gzipped tar archives containing:
+
+- `metadata.json`: Backup metadata (ID, timestamp, description, etc.)
+- `config.json`: All configuration data (users, shares, exports, targets, policies)
+
+## Best Practices
+
+1. **Regular Backups**: Create backups before major configuration changes
+2. **Verify Before Restore**: Always verify backups before restoring
+3. **Test Restores**: Use dry run to test restore operations
+4. **Backup Retention**: Keep multiple backups for different time periods
+5. **Offsite Storage**: Copy backups to external storage for disaster recovery
+6. **Password Management**: Users must reset passwords after restore
+
+## Limitations
+
+- **Passwords**: User passwords cannot be restored (security feature)
+- **ZFS Data**: Backups only include configuration, not ZFS pool/dataset data
+- **Audit Logs**: Audit logs are not included in backups
+- **Jobs**: Background jobs are not included in backups
+
+## Error Handling
+
+- **Invalid Backup**: Verification fails if backup is corrupted
+- **Missing Resources**: Restore skips resources that already exist
+- **Service Errors**: Service configuration errors are logged but don't fail restore
+- **Partial Restore**: Restore continues even if some components fail
+
+## Security Considerations
+
+1. **Backup Storage**: Store backups in secure locations
+2. **Access Control**: Backup endpoints require authentication
+3. **Password Security**: Passwords are never included in backups
+4. **Encryption**: Consider encrypting backups for sensitive environments
+
+## Example Workflow
+
+```bash
+# 1. Create backup before changes
+BACKUP_ID=$(curl -X POST http://localhost:8080/api/v1/backups \
+  -H "Authorization: Bearer <token>" \
+  -H "Content-Type: application/json" \
+  -d '{"description": "Before major changes"}' \
+  | jq -r '.id')
+
+# 2. Verify backup
+curl -X GET "http://localhost:8080/api/v1/backups/$BACKUP_ID?verify=true" \
+  -H "Authorization: Bearer <token>"
+
+# 3. Make configuration changes
+# ... make changes ...
+
+# 4. Test restore (dry run)
+curl -X POST "http://localhost:8080/api/v1/backups/$BACKUP_ID/restore" \
+  -H "Authorization: Bearer <token>" \
+  -H "Content-Type: application/json" \
+  -d '{"dry_run": true}'
+
+# 5. Restore if needed
+curl -X POST "http://localhost:8080/api/v1/backups/$BACKUP_ID/restore" \
+  -H "Authorization: Bearer <token>" \
+  -H "Content-Type: application/json" \
+  -d '{"dry_run": false}'
+```
+
+## Future Enhancements
+
+- **Scheduled Backups**: Automatic backup scheduling
+- **Incremental Backups**: Only backup changes since last backup
+- **Backup Encryption**: Encrypt backup files
+- **Remote Storage**: Support for S3, FTP, etc.
+- **Backup Compression**: Additional compression options
+- **Selective Restore**: Restore specific components only
--- a/docs/ERROR_HANDLING.md
+++ b/docs/ERROR_HANDLING.md
@@ -0,0 +1,242 @@
+# Error Handling & Recovery
+
+## Overview
+
+AtlasOS implements comprehensive error handling with structured error responses, graceful degradation, and automatic recovery mechanisms to ensure system reliability and good user experience.
+
+## Error Types
+
+### Structured API Errors
+
+All API errors follow a consistent structure:
+
+```json
+{
+  "code": "NOT_FOUND",
+  "message": "dataset not found",
+  "details": "tank/missing"
+}
+```
+
+### Error Codes
+
+- `INTERNAL_ERROR` - Unexpected server errors (500)
+- `NOT_FOUND` - Resource not found (404)
+- `BAD_REQUEST` - Invalid request parameters (400)
+- `CONFLICT` - Resource conflict (409)
+- `UNAUTHORIZED` - Authentication required (401)
+- `FORBIDDEN` - Insufficient permissions (403)
+- `SERVICE_UNAVAILABLE` - Service temporarily unavailable (503)
+- `VALIDATION_ERROR` - Input validation failed (400)
+
+## Error Handling Patterns
+
+### 1. Structured Error Responses
+
+All errors use the `errors.APIError` type for consistent formatting:
+
+```go
+if resource == nil {
+    writeError(w, errors.ErrNotFound("dataset").WithDetails(datasetName))
+    return
+}
+```
+
+### 2. Graceful Degradation
+
+Service operations (SMB/NFS/iSCSI) use graceful degradation:
+
+- **Desired State Stored**: Configuration is always stored in the store
+- **Service Application**: Service configuration is applied asynchronously
+- **Non-Blocking**: Service failures don't fail API requests
+- **Retry Ready**: Failed operations can be retried later
+
+Example:
+```go
+// Store the configuration (always succeeds)
+share, err := a.smbStore.Create(...)
+
+// Apply to service (may fail, but doesn't block)
+if err := a.smbService.ApplyConfiguration(shares); err != nil {
+    // Log but don't fail - desired state is stored
+    log.Printf("SMB service configuration failed (non-fatal): %v", err)
+}
+```
+
+### 3. Panic Recovery
+
+All HTTP handlers are wrapped with panic recovery middleware:
+
+```go
+func (a *App) errorMiddleware(next http.Handler) http.Handler {
+    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+        defer recoverPanic(w, r)
+        next.ServeHTTP(w, r)
+    })
+}
+```
+
+Panics are caught and converted to proper error responses instead of crashing the server.
+
+### 4. Atomic Operations with Rollback
+
+Service configuration operations are atomic with automatic rollback:
+
+1. **Write to temporary file** (`*.atlas.tmp`)
+2. **Backup existing config** (`.backup`)
+3. **Atomically replace** config file
+4. **Reload service**
+5. **On failure**: Automatically restore backup
+
+Example (SMB):
+```go
+// Write to temp file
+os.WriteFile(tmpPath, config, 0644)
+
+// Backup existing
+cp config.conf config.conf.backup
+
+// Atomic replace
+os.Rename(tmpPath, configPath)
+
+// Reload service
+if err := reloadService(); err != nil {
+    // Restore backup automatically
+    os.Rename(backupPath, configPath)
+    return err
+}
+```
+
+## Retry Mechanisms
+
+### Retry Configuration
+
+The `errors.Retry` function provides configurable retry logic:
+
+```go
+config := errors.DefaultRetryConfig() // 3 attempts with exponential backoff
+err := errors.Retry(func() error {
+    return serviceOperation()
+}, config)
+```
+
+### Default Retry Behavior
+
+- **Max Attempts**: 3
+- **Backoff**: Exponential (100ms, 200ms, 400ms)
+- **Use Case**: Transient failures (network, temporary service unavailability)
+
+## Error Recovery
+
+### Service Configuration Recovery
+
+When service configuration fails:
+
+1. **Configuration is stored** (desired state preserved)
+2. **Error is logged** (for debugging)
+3. **Operation continues** (API request succeeds)
+4. **Manual retry available** (via API or automatic retry later)
+
+### Database Recovery
+
+- **Connection failures**: Logged and retried
+- **Transaction failures**: Rolled back automatically
+- **Schema errors**: Detected during migration
+
+### ZFS Operation Recovery
+
+- **Command failures**: Returned as errors to caller
+- **Partial failures**: State is preserved, operation can be retried
+- **Validation**: Performed before destructive operations
+
+## Error Logging
+
+All errors are logged with context:
+
+```go
+log.Printf("create SMB share error: %v", err)
+log.Printf("%s service error: %v", serviceName, err)
+```
+
+Error logs include:
+- Error message
+- Operation context
+- Resource identifiers
+- Timestamp (via standard log)
+
+## Best Practices
+
+### 1. Always Use Structured Errors
+
+```go
+// Good
+writeError(w, errors.ErrNotFound("pool").WithDetails(poolName))
+
+// Avoid
+writeJSON(w, http.StatusNotFound, map[string]string{"error": "not found"})
+```
+
+### 2. Handle Service Errors Gracefully
+
+```go
+// Good - graceful degradation
+if err := service.Apply(); err != nil {
+    log.Printf("service error (non-fatal): %v", err)
+    // Continue - desired state is stored
+}
+
+// Avoid - failing the request
+if err := service.Apply(); err != nil {
+    return err // Don't fail the whole request
+}
+```
+
+### 3. Validate Before Operations
+
+```go
+// Good - validate first
+if !datasetExists {
+    writeError(w, errors.ErrNotFound("dataset"))
+    return
+}
+// Then perform operation
+```
+
+### 4. Use Context for Error Details
+
+```go
+// Good - include context
+writeError(w, errors.ErrInternal("failed to create pool").WithDetails(err.Error()))
+
+// Avoid - generic errors
+writeError(w, errors.ErrInternal("error"))
+```
+
+## Error Response Format
+
+All error responses follow this structure:
+
+```json
+{
+  "code": "ERROR_CODE",
+  "message": "Human-readable error message",
+  "details": "Additional context (optional)"
+}
+```
+
+HTTP status codes match error types:
+- `400` - Bad Request / Validation Error
+- `401` - Unauthorized
+- `403` - Forbidden
+- `404` - Not Found
+- `409` - Conflict
+- `500` - Internal Error
+- `503` - Service Unavailable
+
+## Future Enhancements
+
+1. **Error Tracking**: Centralized error tracking and alerting
+2. **Automatic Retry Queue**: Background retry for failed operations
+3. **Error Metrics**: Track error rates by type and endpoint
+4. **User-Friendly Messages**: More descriptive error messages
+5. **Error Correlation**: Link related errors for debugging
--- a/docs/LOGGING_DIAGNOSTICS.md
+++ b/docs/LOGGING_DIAGNOSTICS.md
@@ -0,0 +1,366 @@
+# Logging & Diagnostics
+
+## Overview
+
+AtlasOS provides comprehensive logging and diagnostic capabilities to help monitor system health, troubleshoot issues, and understand system behavior.
+
+## Structured Logging
+
+### Logger Package
+
+The `internal/logger` package provides structured logging with:
+
+- **Log Levels**: DEBUG, INFO, WARN, ERROR
+- **JSON Mode**: Optional JSON-formatted output
+- **Structured Fields**: Key-value pairs for context
+- **Thread-Safe**: Safe for concurrent use
+
+### Configuration
+
+Configure logging via environment variables:
+
+```bash
+# Log level (DEBUG, INFO, WARN, ERROR)
+export ATLAS_LOG_LEVEL=INFO
+
+# Log format (json or text)
+export ATLAS_LOG_FORMAT=json
+```
+
+### Usage
+
+```go
+import "gitea.avt.data-center.id/othman.suseno/atlas/internal/logger"
+
+// Simple logging
+logger.Info("User logged in")
+logger.Error("Failed to create pool", err)
+
+// With fields
+logger.Info("Pool created", map[string]interface{}{
+    "pool": "tank",
+    "size": "10TB",
+})
+```
+
+### Log Levels
+
+- **DEBUG**: Detailed information for debugging
+- **INFO**: General informational messages
+- **WARN**: Warning messages for potential issues
+- **ERROR**: Error messages for failures
+
+## Request Logging
+
+### Access Logs
+
+All HTTP requests are logged with:
+
+- **Timestamp**: Request time
+- **Method**: HTTP method (GET, POST, etc.)
+- **Path**: Request path
+- **Status**: HTTP status code
+- **Duration**: Request processing time
+- **Request ID**: Unique request identifier
+- **Remote Address**: Client IP address
+
+**Example Log Entry:**
+```
+2024-12-20T10:30:56Z [INFO] 192.168.1.100 GET /api/v1/pools status=200 rid=abc123 dur=45ms
+```
+
+### Request ID
+
+Every request gets a unique request ID:
+
+- **Header**: `X-Request-Id`
+- **Usage**: Track requests across services
+- **Format**: 32-character hex string
+
+## Diagnostic Endpoints
+
+### System Information
+
+**GET** `/api/v1/system/info`
+
+Returns comprehensive system information:
+
+```json
+{
+  "version": "v0.1.0-dev",
+  "uptime": "3600 seconds",
+  "go_version": "go1.21.0",
+  "num_goroutines": 15,
+  "memory": {
+    "alloc": 1048576,
+    "total_alloc": 52428800,
+    "sys": 2097152,
+    "num_gc": 5
+  },
+  "services": {
+    "smb": {
+      "status": "running",
+      "last_check": "2024-12-20T10:30:56Z"
+    },
+    "nfs": {
+      "status": "running",
+      "last_check": "2024-12-20T10:30:56Z"
+    },
+    "iscsi": {
+      "status": "stopped",
+      "last_check": "2024-12-20T10:30:56Z"
+    }
+  },
+  "database": {
+    "connected": true,
+    "path": "/var/lib/atlas/atlas.db"
+  }
+}
+```
+
+### Health Check
+
+**GET** `/health`
+
+Detailed health check with component status:
+
+```json
+{
+  "status": "healthy",
+  "timestamp": "2024-12-20T10:30:56Z",
+  "checks": {
+    "zfs": "healthy",
+    "database": "healthy",
+    "smb": "healthy",
+    "nfs": "healthy",
+    "iscsi": "stopped"
+  }
+}
+```
+
+**Status Values:**
+- `healthy`: Component is working correctly
+- `degraded`: Some components have issues but system is operational
+- `unhealthy`: Critical components are failing
+
+**HTTP Status Codes:**
+- `200 OK`: System is healthy or degraded
+- `503 Service Unavailable`: System is unhealthy
+
+### System Logs
+
+**GET** `/api/v1/system/logs?limit=100`
+
+Returns recent system logs (from audit logs):
+
+```json
+{
+  "logs": [
+    {
+      "timestamp": "2024-12-20T10:30:56Z",
+      "level": "INFO",
+      "actor": "user-1",
+      "action": "pool.create",
+      "resource": "pool:tank",
+      "result": "success",
+      "ip": "192.168.1.100"
+    }
+  ],
+  "count": 1
+}
+```
+
+**Query Parameters:**
+- `limit`: Maximum number of logs to return (default: 100, max: 1000)
+
+### Garbage Collection
+
+**POST** `/api/v1/system/gc`
+
+Triggers garbage collection and returns memory statistics:
+
+```json
+{
+  "before": {
+    "alloc": 1048576,
+    "total_alloc": 52428800,
+    "sys": 2097152,
+    "num_gc": 5
+  },
+  "after": {
+    "alloc": 512000,
+    "total_alloc": 52428800,
+    "sys": 2097152,
+    "num_gc": 6
+  },
+  "freed": 536576
+}
+```
+
+## Audit Logging
+
+Audit logs track all mutating operations:
+
+- **Actor**: User ID or "system"
+- **Action**: Operation type (e.g., "pool.create")
+- **Resource**: Resource identifier
+- **Result**: "success" or "failure"
+- **IP**: Client IP address
+- **User Agent**: Client user agent
+- **Timestamp**: Operation time
+
+See [Audit Logging Documentation](./AUDIT_LOGGING.md) for details.
+
+## Log Rotation
+
+### Current Implementation
+
+- **In-Memory**: Audit logs stored in memory
+- **Rotation**: Automatic rotation when max logs reached
+- **Limit**: Configurable (default: 10,000 logs)
+
+### Future Enhancements
+
+- **File Logging**: Write logs to files
+- **Automatic Rotation**: Rotate log files by size/age
+- **Compression**: Compress old log files
+- **Retention**: Configurable retention policies
+
+## Best Practices
+
+### 1. Use Appropriate Log Levels
+
+```go
+// Debug - detailed information
+logger.Debug("Processing request", map[string]interface{}{
+    "request_id": reqID,
+    "user": userID,
+})
+
+// Info - important events
+logger.Info("User logged in", map[string]interface{}{
+    "user": userID,
+})
+
+// Warn - potential issues
+logger.Warn("High memory usage", map[string]interface{}{
+    "usage": "85%",
+})
+
+// Error - failures
+logger.Error("Failed to create pool", err, map[string]interface{}{
+    "pool": poolName,
+})
+```
+
+### 2. Include Context
+
+Always include relevant context in logs:
+
+```go
+// Good
+logger.Info("Pool created", map[string]interface{}{
+    "pool": poolName,
+    "size": poolSize,
+    "user": userID,
+})
+
+// Avoid
+logger.Info("Pool created")
+```
+
+### 3. Use Request IDs
+
+Include request IDs in logs for tracing:
+
+```go
+reqID := r.Context().Value(requestIDKey).(string)
+logger.Info("Processing request", map[string]interface{}{
+    "request_id": reqID,
+})
+```
+
+### 4. Monitor Health Endpoints
+
+Regularly check health endpoints:
+
+```bash
+# Simple health check
+curl http://localhost:8080/healthz
+
+# Detailed health check
+curl http://localhost:8080/health
+
+# System information
+curl http://localhost:8080/api/v1/system/info
+```
+
+## Monitoring
+
+### Key Metrics
+
+Monitor these metrics for system health:
+
+- **Request Duration**: Track in access logs
+- **Error Rate**: Count of error responses
+- **Memory Usage**: Check via `/api/v1/system/info`
+- **Goroutine Count**: Monitor for leaks
+- **Service Status**: Check service health
+
+### Alerting
+
+Set up alerts for:
+
+- **Unhealthy Status**: System health check fails
+- **High Error Rate**: Too many error responses
+- **Memory Leaks**: Continuously increasing memory
+- **Service Failures**: Services not running
+
+## Troubleshooting
+
+### Check System Health
+
+```bash
+curl http://localhost:8080/health
+```
+
+### View System Information
+
+```bash
+curl http://localhost:8080/api/v1/system/info
+```
+
+### Check Recent Logs
+
+```bash
+curl http://localhost:8080/api/v1/system/logs?limit=50
+```
+
+### Trigger GC
+
+```bash
+curl -X POST http://localhost:8080/api/v1/system/gc
+```
+
+### View Request Logs
+
+Check application logs for request details:
+
+```bash
+# If logging to stdout
+./atlas-api | grep "GET /api/v1/pools"
+
+# If logging to file
+tail -f /var/log/atlas-api.log | grep "status=500"
+```
+
+## Future Enhancements
+
+1. **File Logging**: Write logs to files with rotation
+2. **Log Aggregation**: Support for centralized logging (ELK, Loki)
+3. **Structured Logging**: Full JSON logging support
+4. **Log Levels per Component**: Different levels for different components
+5. **Performance Logging**: Detailed performance metrics
+6. **Distributed Tracing**: Request tracing across services
+7. **Log Filtering**: Filter logs by level, component, etc.
+8. **Real-time Log Streaming**: Stream logs via WebSocket
--- a/docs/VALIDATION.md
+++ b/docs/VALIDATION.md
@@ -0,0 +1,232 @@
+# Input Validation & Sanitization
+
+## Overview
+
+AtlasOS implements comprehensive input validation and sanitization to ensure data integrity, security, and prevent injection attacks. All user inputs are validated before processing.
+
+## Validation Rules
+
+### ZFS Names (Pools, Datasets, ZVOLs, Snapshots)
+
+**Rules:**
+- Must start with alphanumeric character
+- Can contain: `a-z`, `A-Z`, `0-9`, `_`, `-`, `.`, `:`
+- Cannot start with `-` or `.`
+- Maximum length: 256 characters
+- Cannot be empty
+
+**Example:**
+```go
+if err := validation.ValidateZFSName("tank/data"); err != nil {
+    // Handle error
+}
+```
+
+### Usernames
+
+**Rules:**
+- Minimum length: 3 characters
+- Maximum length: 32 characters
+- Can contain: `a-z`, `A-Z`, `0-9`, `_`, `-`, `.`
+- Must start with alphanumeric character
+
+**Example:**
+```go
+if err := validation.ValidateUsername("admin"); err != nil {
+    // Handle error
+}
+```
+
+### Passwords
+
+**Rules:**
+- Minimum length: 8 characters
+- Maximum length: 128 characters
+- Must contain at least one letter
+- Must contain at least one number
+
+**Example:**
+```go
+if err := validation.ValidatePassword("SecurePass123"); err != nil {
+    // Handle error
+}
+```
+
+### Email Addresses
+
+**Rules:**
+- Optional field (can be empty)
+- Maximum length: 254 characters
+- Must match email format pattern
+- Basic format validation (RFC 5322 simplified)
+
+**Example:**
+```go
+if err := validation.ValidateEmail("user@example.com"); err != nil {
+    // Handle error
+}
+```
+
+### SMB Share Names
+
+**Rules:**
+- Maximum length: 80 characters
+- Can contain: `a-z`, `A-Z`, `0-9`, `_`, `-`, `.`
+- Cannot be reserved Windows names (CON, PRN, AUX, NUL, COM1-9, LPT1-9)
+- Must start with alphanumeric character
+
+**Example:**
+```go
+if err := validation.ValidateShareName("data-share"); err != nil {
+    // Handle error
+}
+```
+
+### iSCSI IQN (Qualified Name)
+
+**Rules:**
+- Must start with `iqn.`
+- Format: `iqn.yyyy-mm.reversed.domain:identifier`
+- Maximum length: 223 characters
+- Year-month format validation
+
+**Example:**
+```go
+if err := validation.ValidateIQN("iqn.2024-12.com.atlas:storage.target1"); err != nil {
+    // Handle error
+}
+```
+
+### Size Strings
+
+**Rules:**
+- Format: number followed by optional unit (K, M, G, T, P)
+- Units: K (kilobytes), M (megabytes), G (gigabytes), T (terabytes), P (petabytes)
+- Case insensitive
+
+**Examples:**
+- `"10"` - 10 bytes
+- `"10K"` - 10 kilobytes
+- `"1G"` - 1 gigabyte
+- `"2T"` - 2 terabytes
+
+**Example:**
+```go
+if err := validation.ValidateSize("10G"); err != nil {
+    // Handle error
+}
+```
+
+### Filesystem Paths
+
+**Rules:**
+- Must be absolute (start with `/`)
+- Maximum length: 4096 characters
+- Cannot contain `..` (path traversal)
+- Cannot contain `//` (double slashes)
+- Cannot contain null bytes
+
+**Example:**
+```go
+if err := validation.ValidatePath("/tank/data"); err != nil {
+    // Handle error
+}
+```
+
+### CIDR/Hostname (NFS Clients)
+
+**Rules:**
+- Can be wildcard: `*`
+- Can be CIDR notation: `192.168.1.0/24`
+- Can be hostname: `server.example.com`
+- Hostname must follow DNS rules
+
+**Example:**
+```go
+if err := validation.ValidateCIDR("192.168.1.0/24"); err != nil {
+    // Handle error
+}
+```
+
+## Sanitization
+
+### String Sanitization
+
+Removes potentially dangerous characters:
+- Null bytes (`\x00`)
+- Control characters (ASCII < 32, except space)
+- Removes leading/trailing whitespace
+
+**Example:**
+```go
+clean := validation.SanitizeString(userInput)
+```
+
+### Path Sanitization
+
+Normalizes filesystem paths:
+- Removes leading/trailing whitespace
+- Normalizes slashes (backslash to forward slash)
+- Removes multiple consecutive slashes
+
+**Example:**
+```go
+cleanPath := validation.SanitizePath("/tank//data/")
+// Result: "/tank/data"
+```
+
+## Integration
+
+### In API Handlers
+
+Validation is integrated into all create/update handlers:
+
+```go
+func (a *App) handleCreatePool(w http.ResponseWriter, r *http.Request) {
+    // ... decode request ...
+    
+    // Validate pool name
+    if err := validation.ValidateZFSName(req.Name); err != nil {
+        writeError(w, errors.ErrValidation(err.Error()))
+        return
+    }
+    
+    // ... continue with creation ...
+}
+```
+
+### Error Responses
+
+Validation errors return structured error responses:
+
+```json
+{
+  "code": "VALIDATION_ERROR",
+  "message": "validation error on field 'name': name cannot be empty",
+  "details": ""
+}
+```
+
+## Security Benefits
+
+1. **Injection Prevention**: Validates inputs prevent command injection
+2. **Path Traversal Protection**: Path validation prevents directory traversal
+3. **Data Integrity**: Ensures data conforms to expected formats
+4. **System Stability**: Prevents invalid operations that could crash services
+5. **User Experience**: Clear error messages guide users to correct input
+
+## Best Practices
+
+1. **Validate Early**: Validate inputs as soon as they're received
+2. **Sanitize Before Storage**: Sanitize strings before storing in database
+3. **Validate Format**: Check format before parsing (e.g., size strings)
+4. **Check Length**: Enforce maximum lengths to prevent DoS
+5. **Whitelist Characters**: Only allow known-safe characters
+
+## Future Enhancements
+
+1. **Custom Validators**: Domain-specific validation rules
+2. **Validation Middleware**: Automatic validation for all endpoints
+3. **Schema Validation**: JSON schema validation
+4. **Rate Limiting**: Prevent abuse through validation
+5. **Input Normalization**: Automatic normalization of valid inputs
--- a/docs/openapi.yaml
+++ b/docs/openapi.yaml