atlas/docs/ERROR_HANDLING.md

# Error Handling & Recovery

## Overview

AtlasOS implements comprehensive error handling with structured error responses, graceful degradation, and automatic recovery mechanisms to ensure system reliability and good user experience.

## Error Types

### Structured API Errors

All API errors follow a consistent structure:

```json
{
  "code": "NOT_FOUND",
  "message": "dataset not found",
  "details": "tank/missing"
}
```

### Error Codes

- `INTERNAL_ERROR` - Unexpected server errors (500)
- `NOT_FOUND` - Resource not found (404)
- `BAD_REQUEST` - Invalid request parameters (400)
- `CONFLICT` - Resource conflict (409)
- `UNAUTHORIZED` - Authentication required (401)
- `FORBIDDEN` - Insufficient permissions (403)
- `SERVICE_UNAVAILABLE` - Service temporarily unavailable (503)
- `VALIDATION_ERROR` - Input validation failed (400)

## Error Handling Patterns

### 1. Structured Error Responses

All errors use the `errors.APIError` type for consistent formatting:

```go
if resource == nil {
    writeError(w, errors.ErrNotFound("dataset").WithDetails(datasetName))
    return
}
```

### 2. Graceful Degradation

Service operations (SMB/NFS/iSCSI) use graceful degradation:

- **Desired State Stored**: Configuration is always stored in the store
- **Service Application**: Service configuration is applied asynchronously
- **Non-Blocking**: Service failures don't fail API requests
- **Retry Ready**: Failed operations can be retried later

Example:
```go
// Store the configuration (always succeeds)
share, err := a.smbStore.Create(...)

// Apply to service (may fail, but doesn't block)
if err := a.smbService.ApplyConfiguration(shares); err != nil {
    // Log but don't fail - desired state is stored
    log.Printf("SMB service configuration failed (non-fatal): %v", err)
}
```

### 3. Panic Recovery

All HTTP handlers are wrapped with panic recovery middleware:

```go
func (a *App) errorMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        defer recoverPanic(w, r)
        next.ServeHTTP(w, r)
    })
}
```

Panics are caught and converted to proper error responses instead of crashing the server.

### 4. Atomic Operations with Rollback

Service configuration operations are atomic with automatic rollback:

1. **Write to temporary file** (`*.atlas.tmp`)
2. **Backup existing config** (`.backup`)
3. **Atomically replace** config file
4. **Reload service**
5. **On failure**: Automatically restore backup

Example (SMB):
```go
// Write to temp file
os.WriteFile(tmpPath, config, 0644)

// Backup existing
cp config.conf config.conf.backup

// Atomic replace
os.Rename(tmpPath, configPath)

// Reload service
if err := reloadService(); err != nil {
    // Restore backup automatically
    os.Rename(backupPath, configPath)
    return err
}
```

## Retry Mechanisms

### Retry Configuration

The `errors.Retry` function provides configurable retry logic:

```go
config := errors.DefaultRetryConfig() // 3 attempts with exponential backoff
err := errors.Retry(func() error {
    return serviceOperation()
}, config)
```

### Default Retry Behavior

- **Max Attempts**: 3
- **Backoff**: Exponential (100ms, 200ms, 400ms)
- **Use Case**: Transient failures (network, temporary service unavailability)

## Error Recovery

### Service Configuration Recovery

When service configuration fails:

1. **Configuration is stored** (desired state preserved)
2. **Error is logged** (for debugging)
3. **Operation continues** (API request succeeds)
4. **Manual retry available** (via API or automatic retry later)

### Database Recovery

- **Connection failures**: Logged and retried
- **Transaction failures**: Rolled back automatically
- **Schema errors**: Detected during migration

### ZFS Operation Recovery

- **Command failures**: Returned as errors to caller
- **Partial failures**: State is preserved, operation can be retried
- **Validation**: Performed before destructive operations

## Error Logging

All errors are logged with context:

```go
log.Printf("create SMB share error: %v", err)
log.Printf("%s service error: %v", serviceName, err)
```

Error logs include:
- Error message
- Operation context
- Resource identifiers
- Timestamp (via standard log)

## Best Practices

### 1. Always Use Structured Errors

```go
// Good
writeError(w, errors.ErrNotFound("pool").WithDetails(poolName))

// Avoid
writeJSON(w, http.StatusNotFound, map[string]string{"error": "not found"})
```

### 2. Handle Service Errors Gracefully

```go
// Good - graceful degradation
if err := service.Apply(); err != nil {
    log.Printf("service error (non-fatal): %v", err)
    // Continue - desired state is stored
}

// Avoid - failing the request
if err := service.Apply(); err != nil {
    return err // Don't fail the whole request
}
```

### 3. Validate Before Operations

```go
// Good - validate first
if !datasetExists {
    writeError(w, errors.ErrNotFound("dataset"))
    return
}
// Then perform operation
```

### 4. Use Context for Error Details

```go
// Good - include context
writeError(w, errors.ErrInternal("failed to create pool").WithDetails(err.Error()))

// Avoid - generic errors
writeError(w, errors.ErrInternal("error"))
```

## Error Response Format

All error responses follow this structure:

```json
{
  "code": "ERROR_CODE",
  "message": "Human-readable error message",
  "details": "Additional context (optional)"
}
```

HTTP status codes match error types:
- `400` - Bad Request / Validation Error
- `401` - Unauthorized
- `403` - Forbidden
- `404` - Not Found
- `409` - Conflict
- `500` - Internal Error
- `503` - Service Unavailable

## Future Enhancements

1. **Error Tracking**: Centralized error tracking and alerting
2. **Automatic Retry Queue**: Background retry for failed operations
3. **Error Metrics**: Track error rates by type and endpoint
4. **User-Friendly Messages**: More descriptive error messages
5. **Error Correlation**: Link related errors for debugging